https://music-ir.org/mirex/w/api.php?action=feedcontributions&user=138.37.33.58&feedformat=atomMIREX Wiki - User contributions [en]2024-03-29T05:52:39ZUser contributionsMediaWiki 1.31.1https://music-ir.org/mirex/w/index.php?title=2005:Audio_Genre&diff=2142005:Audio Genre2005-05-13T10:16:55Z<p>138.37.33.58: /* Potential Participants */</p>
<hr />
<div>==Proposer==<br />
<br />
Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk<br />
<br />
<br />
==Title==<br />
<br />
Genre Classification from polyphonic audio.<br />
<br />
<br />
==Description==<br />
<br />
The automatic classification of polyphonic musical audio (in PCM format) into a single high-level genre per example. If there is sufficient demand, a multiple genre track could be defined, requiring submissions to identify each genre (without prior knowledge of the number of labels), with the precision and recall scores calculated for each result.<br />
<br />
1) Input data<br />
The input for this task is a set of sound file excerpts adhering to the format, metadata and content requirements mentioned below.<br />
<br />
Audio format:<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* Whole files, algorithms may use segments at authors discretion<br />
<br />
Audio content:<br />
* polyphonic music<br />
* data set should include at least 10 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Dancehall Ragga, Ballroom Dance, Electronic/Modern Dance (Jungle, Drum and Bass, Techno, House), Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock).<br />
* Final set of Genres to be decided on data available<br />
* Genres will be organised hierachically, on at least two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).<br />
* both live performances and sequenced music are eligible<br />
* Each class should be represented by a minimum approximately 100 examples. It is NOT essential that the same number of examples represent each class.<br />
* A tuning database will NOT be provided. However the RWC Magnatune database used for the 2004 Audio desciption contest is still available (Training part 1 [http://www.iua.upf.es/mtg/ismir2004/contest/Training_Tracks1.tar.gz], Training part 2 [http://www.iua.upf.es/mtg/ismir2004/contest/Training_Tracks2.tar.gz], Development part 1 [http://www.iua.upf.es/mtg/ismir2004/contest/Development_Tracks1.tar.gz], Development part 2 [http://www.iua.upf.es/mtg/ismir2004/contest/Development_Tracks2.tar.gz])<br />
<br />
<br />
Metadata:<br />
* By definition each example must have a genre label corresponding to one of the lowest level output classes. (Upper-level labels will be interpolated by evaluation software).<br />
* Where possible existing genre labels should be confirmed by two or more sources, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants. Viable sources for this metadata include CDDB, http://www.allmediaguide.com (http://www.allmusic.com), MP3.com or agreement by two or more human subjects.<br />
* The training set should be defined by a text file with one entry per line, in the following format(<> should be omitted, used here for clarity):<br><example path and filename>\t<bottom-level genre classification>\t<top-level genre classification>\n<br><br />
<br />
2) Output results<br />
<br />
* Results should be output into a text file with one entry per line in either of the following formats (<> should be omitted, used here for clarity):<br />
** <example path and filename>\t<lowest-level genre classification>\n<br>(Higher level classifications will be interpolated by evaluation framework)<br>'''or'''<br><br />
** <example path and filename>\t<bottom-level genre classification>\t<top-level genre classification>\n<br>(This example uses a 2 level hierachy, number of labels is limited to height of taxonomy)<br />
* The following optional tab delimited descriptor format can be used by authors that wish to allow hybridisation of their submissions with other algorithms (including WEKA for classifier benchmarking)<br />
** Descriptors for each example should be contained in their own file, named according to the following format: originalFileName.wav.features<br />
** The file should an ascii text file in the following format:<br><columnLabel1>\t<columnLabel2>\t<columnLabel3>...etc<br>0.0\t0.0\t0.0...etc<br><br />
<br />
3) Maximum running time<br />
<br />
* The maximum running time for a single iteration of a submitted algorithm will be 24 hours (allowing a maximum of 72 hours for 3-fold cross-validation)<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, High<br />
* Elias Pampalk (ÖFAI), elias@oefai.at, High<br />
* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, High<br />
* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, CONFIRMED<br />
* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, CONFIRMED<br />
* Juan Jose Burred (Technical University of Berlin), burred@nue.tu-berlin.de, High<br />
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium<br />
* François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium<br />
* Beth Logan, HP, beth.logan@hp.com, Medium<br />
* Nicolas Scaringella (EPFL), nicolas.scaringella@epfl.ch, CONFIRMED<br />
* McKinney and Breebart (Philips research labs), martin.mckinney@philips.com, jeroen.breebaart@philips.com<br />
* Gao Sheng and Kai Chen(Institute for Infocomm Research(A*STAR)),gaosheng@i2r.a-star.edu.sg,kchen@i2r.a-star.edu.sg<br />
* Enrique Alexandre & Manuel Rosa (University of Alcala, Spain), enrique.alexandre@uah.es, manuel.rosa@uah.es<br />
* Peter Ahrendt and Anders Meng (ISP, IMM, Technical University of Denmark), pa@imm.dtu.dk, CONFIRMED<br />
<br />
==Evaluation Procedures==<br />
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.<br />
<br />
Evaluation measures:<br />
* 1 point will be scored for each correct label. I.e. for a two level hierachy correctly assigning the the labels Jazz&Blues and Blues to an example scores 2 points.<br />
* If only the lowest-level classification (in the hierachical taxonomy) is returned the higher level classification will be interpolated. I.e. (in the previois example) correctly assigning the label Blues will score 2 points.<br />
* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this will be normalised according to class size).<br />
* Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.<br />
<br />
Evaluation framework:<br />
<br />
Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k, first release due end of Feb 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available in March for submission development.<br />
<br />
==Format for algorithm calls==<br />
<br />
There are four formats for calls to code external to D2K that will be supported:<br />
* CommandName inputFileNameAndPath outputFileNameAndPath<br />
* CommandName inputFileNameAndPath (ouput file name created by adding an extension, e.g. ".features")<br />
<br />
The second two formats allow an additional file to be passed as a parameter:<br />
* CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath<br />
* CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath (ouput file name created by adding an extension to inputFileNameAndPath1, e.g. ".features")<br />
<br />
'''E.g.'''<br><br />
ExtractFeatures C:\inTrainFiles.txt C:\outTrainFeatures.feat<br><br />
ExtractFeatures C:\inTestFiles.txt C:\outTestFeatures.feat<br><br />
TrainModel C:\outTrainFeatures.feat<br><br />
ApplyModel C:\outTrainFeatures.feat.model C:\outTestFeatures.feat C:\results.txt<br><br />
<br />
==Relevant Test Collections==<br />
<br />
Re-use Magnatune database <br />
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)<br />
Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments)<br />
Solicite contributions from http://creativecommons.org/audio/, http://www.epitonic.com, http://www.mp3.com/ (offers several free audio streams) and similar sites<br />
Validate metadata though free services such as http://www.MP3.com, http://ww.allmusic.com and CDDB<br />
<br />
Ground truth annotations:<br />
<br />
All annotations should be validated, rather than accepting supplied genre labels, by at least two sources including non-participating volunteers (if possible). <br />
<br />
<br />
==Review 1==<br />
<br />
The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples. Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.<br />
<br />
The tasks are well-defined, easily understood, and appear to have some practical importance. The evaluation and testing procedures are very good. This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.<br />
<br />
My only comments relate to the choice of ground truth data. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution. If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders. (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)<br />
<br />
My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.<br />
<br />
In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results. In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete. Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat. <br />
<br />
Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions. However, having each site run their own classifiers is a huge win in terms of the logistics of running the test. I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.<br />
<br />
I'm not sure how important the M2K/D2K angle is. It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation. By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding. <br />
<br />
In terms of the genre contest, the big issue is the unreliability and unclear definitions of the ground truth labels. It seems weird to have one evaluation on the ability to distinguish an arbitrary set of artists - a very general-sounding problem - and another contest which is specifically dominated by the ability to distinguish classical from jazz from rock - a very specific, and perhaps not very important, problem. <br />
<br />
Again in this case I don't particularly like the idea of trying to get multiple labellings: for artists, I thought it was unnecessary because agreement will be very high. Here, I think it's of dubious value because agreement will be so low; in both cases, errors in ground truth impact all participants equally, and so are not really a concern - we are mostly interested in relative values, so a ceiling on absolute performance due to a few 'incorrect' reference labels is of little consequence. <br />
<br />
Clearly, we can run a genre contest: I would again advocate for real music, and not worry too much about copyright issues, and not even worry too much about where the genre ground truth comes from, since it is always pretty suspect; allmusic.com is as good a source as any. But I personally find this contest of less intellectual interest than artist ID, even though it has historically received more attention, because of the poor definition of the true, underlying classes. <br />
<br />
I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms. It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).<br />
<br />
==Review 2==<br />
<br />
The single genre problem is well defined and seems to be a relevant problem for the MIR community nowadays. Obviously, it would be more relevant to classify each track into multiple genres or to use a hierarchy of genres, but the proposal does not deal with these issues in a satisfying way. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are nodes for Electronic and Jazz/Blues, where lies the leaf Electro-jazz ?<br />
I suggest that the contest concentrates on the well-defined simple genre problem. An interesting development of it would be to ask algorithms to associate a percentage of probability to each predefined genre on each track, instead of outputing a single genre with 100% probability.<br />
Regarding the input format, I think that whole files are better (the total duration and the volume variation are already good genre descriptors) and that polyphony is not required (classical music contains many works for solo instruments).<br />
<br />
I have no precise opinion regarding the defined genres, since this is more of a cultural importance. I'm not sure that Rock is less diverse than World (what's the common point between Elvis and Radiohead ?). Also I am surprised that there is no Rap/RnB.<br />
The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Thus I would like this choice to be more discussed by the participants.<br />
<br />
The list of participants is relevant. McKinney and Breebart could be added.<br />
<br />
It is a good idea to accept many programming languages for submission. However it seems quite difficult to implement the learning phase, because each algorithm may use different structures to store learnt data. For instance, when the algorithm computes descriptors and feeds them through a classifier, is it possible to select the best descriptors ? If not, it is not realistic to suppose that the participant has to do it beforehand on his own limited set of data. Then I see two possibilities: either participants are given 50% of the database and do all the learning work themselves (then no k-fold cross validation is performed), or submissions concern only sets of descriptors and not full classification algorithms. The second choice has the advantage of allowing to compare different sets of descriptors with the same classifiers.<br />
<br />
The test data are relevant but still a bit vague. Obviously existing databases should be used again and completed with new annotated data. The participants should list their own databases in detail and put them in common for evaluation in order to evaluate the time needed to annotate new data.<br />
<br />
==Downie's Comment==<br />
<br />
1. Think genre tasks are kinda fun, actually. Devil is in the details. Would give my eye teeth to avoid manually labelling genre classes. You set up eight classes with 100-150 examples. That comes to 800-1200 labels that need applying. Can we as a group come up with a possible standardized source for genre labels and then, even though they are not perfect, live with our choice? Perhaps in this early days, we would be best served by looking at only the broadest of categories and not fussing about the fine-grained subdivisions?<br />
<br />
2. Would be interesting to have a TRUE genre task! As we learned in the UPF doctoral seminar prior to ISMIR 2004, genre is properly defined as the "use" of the music: dance, liturgical, funereal, etc. What we are calling genre here is really style. Just a thought.<br />
<br />
==Kris' thoughts==<br />
<br />
Contents:<br />
# Multiple genres and Artist ID<br />
# Framework issues and algorithm submission<br />
# Producing ground truth and answers to Downie's comments<br />
# Who has data?<br />
----<br />
1. Multiple genres and Artist ID<br />
<br />
Dan Ellis wrote:<br />
<br />
> About multiple genre classification: I have pretty serious doubts<br><br />
> about genre classification in the first place, because of the<br><br />
> seemingly arbitrary nature of the classes and how they are assigned.<br><br />
<br />
IMHO the genre classification task is to reproduce an arbitrary set of culturally assigned classes. Tim Pohle (at the ISMIR 2004 grad school) gave an interesting talk on using genre classifiers to reproduce arbitrary, user assigned classes, to manage a user's personal music collection. We also discussed how to suggest new music choices from a larger catalog by thresholding the probabilities of membership of new music to favored classes.<br />
<br />
Dan Ellis wrote:<br />
<br />
> This is why I prefer artist identification as a task. That said,<br><br />
> assigning multiple genres seems not much worse, but not much better<br><br />
> either. Allowing for fuzzy, multiple characteristics seems to address<br><br />
> some of the problems with genres -- which is good -- but now defining<br><br />
> the ground truth is even more arguable and problematic, since we now<br><br />
> have that much more ground truth data to define -- degree of<br><br />
> membership, and over a larger set of classes.<br><br />
<br />
I also prefer the artist ID task, but for different reasons; I think we use too few classes to properly evaluate the Genre classification task as some models/techniques fall over if given too many classes to evaluate. Obviously this has come about because of storage, IP and ground-truth constraints. However, if a hierarchy is used (as suggested for the symbolic track), rather than a bag of labels, the ground-truth problem is no bigger as higher level labels can be interpolated and it will be easier to both expand the database to include more pieces and to implement a finer granuality of labels (more sub-genres) in later evaluations. Small taxonomies are the biggest hurdle in the accurate evaluation of Genre classification systems, we can probably define around 10 lowest level classes for this year, but should aim to add the same number again next year and the year after until we can be confident that we have a database that poses a classification problem that is as difficult as a real world application (such as organizing/classifying the whole Epitonic catalog).<br />
<br />
Dan Ellis wrote:<br />
<br />
> One of the reasons I am interested in a parallel evaluation of genre<br><br />
> classification and artist ID is that it may provide some objective<br><br />
> evidence for my gut bias against genres: if the results of different<br><br />
> algorithms under genre classification are more equivocal than artist<br><br />
> ID (i.e. they reveal less about the difference between the<br><br />
> algorithms), then that's some kind of evidence that the task itself is<br><br />
> ill-posed. My suspicion is that multiple, real-valued genre<br><br />
> memberships will be even less revealing.<br />
<br />
I also believe that many classification techniques and feature extractions are vulnerable to a smaller numbers of examples per class and I think this is far more likely to show up in the comparison of the performance of an algorithm between the two tracks (my own submission will be modified for the artist id track). Artist identification is about modeling a natural grouping within the data, whereas genres are not neccessarily natural groupings and I believe their accurate modeling of hierarchical, multi-modal genres is likely to be more complex than that of modeling an artist's work (although this is alleviated by the additional data available). An artist may work in a number of styles but there is *usually* some dimension along which all the examples are grouped.<br />
<br />
Dan Ellis wrote:<br />
<br />
> The most important thing, I think, is to define the evaluation to support <br><br />
>(and encourage) the largest number of participants, meaning that we could <br><br />
>include this as an option, but also evaluate a 1-best genre ID to remain <br><br />
>accessible to algorithms that intrinsically can only report one class.<br />
<br />
With this in mind I think we should opt for a hierarchical taxonomy, which can support direct comparison of hierarchical classifiers, single label classifiers (by interpolating higher level classifications in evaluation framework) and multiple label classifiers (in a somewhat limited fashion, perhaps with a penalty for additional incorrect labels, which is probably not fair, or by limiting number of labels to match height of taxonomy). I suggest that each correct label scores one point, e.g. rock/pop, rock, indie rock would score 3 if all labels are correct.<br />
<br />
----<br />
2. Framework issues and algorithm submission<br />
<br />
I don't think it is particularly ambitious to have people submit their code for evaluation at a single local. I have already implemented a basic D2K framework that can run anything that will run from the command line including Matlab. The only constraint is that a submission will have to conform to a simple text file input and output format. Marsyas-0.1 and Matlab examples have been produced and I am happy for people to take in IO portions directly from this code if they wish. Having code submitted to a central evaluation site will allow us to perform cross-validation experiments and assess exactly how much variance their is in each system's performance. This would not be so essential if we had a very large data set (min 10,000 examples) however we are going to get nowhere near that many (maybe in later years...). It was also suggested in the reviews that this would hamstring feature selection techniques (see review 2) but I don't believe this, surely the feature selection code (including any classifier used) would be correctly implemented in the feature extraction phase.<br />
<br />
I could also define an optional simple text file format for descriptors. This would allow the hybridization of any submitted systems using this format and the use of a bench mark classifier to evaluate the power of the descriptors and classifiers independent of each other. I would be happy to provide several bench mark classifiers for this purpose (possibly by creating an interface to Weka). I would also be interested in seeing the performance of a mixture of experts system, built from all the submitted systems, which should, in theory, be able to equal or better the performance of all of the submitted systems.<br />
<br />
M2K is coming up to its Alpha release and will include a cut down version of the competition framework so that people can see how it works (External integration itinerary). As D2K can run across X windows we could even provide a virtual lab evaluation setup, so that each participant could run their own submission (without violating any IP laws) if they really wanted, and ensure that it ran ok. Anyone can get a license for D2K and the framework will probably be included in a later version of M2K so anyone can make sure that their submission works wok ahead of time.<br />
<br />
----<br />
3. Producing ground truth and answers to Downie's comments<br />
<br />
First I don't think we need to send out a tuning database, it creates problems and solves none. If data is held at evaluation site we don't have any SIP issues and as I stated earlier, anyone could launch their submission themselves in D2K across an X Windows session (note all console output from code external to D2K is collected and forwarded to D2K console to aid debugging). If we use IP free databases, we are unlikely to be able validate ground-truth with on line services such as http://www.allmediaguide.com/ and it has also been suggested that IP-free databases are not necessarily representative of the whole music community. Several people have said that it doesn't matter if some of labels are incorrect however I'm not afraid to volunteer to validate the labels of a subset of the data (say 200 files, humans can get through them quicker than you'd think) and if there were sufficient volunteers it would go a long way to establishing an IP free research dbase with good ground-truth (if I don't get any volunteers I won't consider this an option, so email me!).<br />
<br />
Personally I think we should use a large volume of copyrighted material, with labels confirmed by at least two sources (existing dbase label, CDDB and allmediaguide, or a human labeler). The format should be WAV (MP3s would have to be decoded to this anyway) and will be mono unless anyone specifically requests stereo (both can be made available or can be handled by framework).<br />
<br />
Should we rename this the Style classification task?<br />
----<br />
4. Who has data?<br />
Anyone with music (with or without ground-truth) that we can use should make themselves known ASAP. I can provide a fair selection of white label (IP-free) dance music in at least 3 subgenres, with labels defined by 3 expert listeners.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=4152005:Audio Onset Detect2005-05-04T10:30:50Z<p>138.37.33.58: /* Relevant Test Collections */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
Pierre Leveau (Laboratoire d'Acoustique Musicale, GET-ENST (Télécom Paris)) leveau at lam dot jussieu dot fr<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data are monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel(mono) or stereo <br />
* file length between 8 and 15 seconds<br />
* File names: <br />
<br />
''Audio content'':<br />
<br />
The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.<br />
<br />
The dataset contains 100 files from 5 classes annotated as follows:<br />
* 30 solo drum excerpts cross-annotated by 3 people<br />
* 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people<br />
* 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people<br />
* 15 complex mixes cross-annotated by 5 people<br />
* 15 complex mixes synthesized from MIDI<br />
<br />
''Nomenclature''<br />
<br />
<AudioFileName>.wav for the audio file<br />
<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>.output.<br />
<br />
<br />
''Onset file Format''<br />
<br />
<onset time(in seconds)>\n<br />
<br />
where \n denotes the end of line. The < and > characters are not included.<br />
<br />
<br />
3) '''Syntax'''<br />
<br />
Competitors will submit their algorithms with some sets of parameters (e.g. thresholds, analysis frame size,...), and the corresponding syntax of the algorithm. Each set of parameters will be tested in the evaluation. The "best" annotation will be used for the final ranking, the others results will be displayed on a GD, FP plane (see Evaluation section).<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau at lam dot jussieu dot fr><br />
Laurent Daudet <daudet at lam dot jussieu dot fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr>,<br />
Geoffroy Peeters <peeters@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
*Centre for Music and Science, Cambridge<br />
Nick Collins <nc272 at cam dot ac dot uk><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). Doubled onsets (two detections for one ground-truth onset) and merged onsets (one detection for two ground-truth onsets) will be taken into account in the evaluation.<br />
<br />
<br />
We thus define the '''FP rate''':<br />
''FP = 100. * (Ofp+Od) / Or''<br />
<br />
and the '''CD Rate''':<br />
''CD = 100. * (Or-Ofn-Om) / Or''<br />
<br />
<br />
with<br />
<br />
''Or'': number of correctly detected onsets<br />
<br />
''Ofn'': number of missed onsets<br />
<br />
''Om'': number of merged onsets<br />
<br />
''Ot'': number of ground-truth onsets<br />
<br />
''Ofp'': number of false positive onsets<br />
<br />
''Od'': number of double onsets<br />
<br />
<br />
Because files are cross-annotated, the mean CD and FP rates are defined by averaging CD and FP rates computed for each annotation.<br />
<br />
<br />
If an algorithm accept parameters (e.g. threshold for these based on detection functions), it will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two (up to 15 parameterizations can be submitted). These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
<br />
To establish a ranking (and indicate a winner...), we will compute the error rate (inspired from Alexander Lerch's work) :<br />
<br />
''q = (Ot - (Ofn + Ofp + Od + Om)) / (Or + (Ofn + Ofp + Od + Om))''<br />
<br />
This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
'''Evaluation measures:'''<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
<br />
'''More detailed data:'''<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data are recordings made by MTG at UPF Barcelona and excerpts from the RWC database. MIDI data are excerpts from the RWC database. Audio annotations were conducted by the Centre for Digital Music at QMU London (69% of annotations), Musical Acoustics Lab at Paris 6 University (18%), MTG at UPF Barcelona (11%) and Analysis Synthesis Group at IRCAM Paris (2%). MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm ) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.<br />
<br />
The defined ground-truth can be critical for the evaluation. For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock. For real world sounds, precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation. <br />
<br />
Article about annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=8022005:Audio and Symbolic Key2005-04-05T15:59:22Z<p>138.37.33.58: /* Relevant Test Collections */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es) [moderate]<br />
* Steffen Pauws (steffen.pauws@philips.com) [high]<br />
* Ozgur Izmirli (oizm@conncoll.edu) [moderate]<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg) [low]<br />
<br />
'''Symbolic Key-Finding''':<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi) [high]<br />
* Ming Li (mli@cmp.uea.ac.uk) and Ronan Sleep (mrs@cmp.uea.ac.uk) [high]<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* David Temperley (dtemp@theory.esm.rochester.edu) and Daniel Sleator (sleator@cs.cmu.edu) [high]<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi) [high]<br />
* Craig Sapp (craig@ccrma.stanford.edu) [moderate]<br />
<br />
==Evaluation Procedures==<br />
<br />
'''Test Set''': The test set we propose to use will consist of pieces for which the keys are known. For example, symphonies and concertos by well-known composers often have the keys stated in the title of the piece. The excerpts will typically be the beginnings of the pieces as this is one part of the piece for which establishing of the global and known key can be guaranteed. Different excerpt durations will be considered: 30 seconds, 20 seconds and 10 seconds.<br />
<br />
'''Input/Output''': The input to the system should be some musical excerpt (either audio or MIDI) and the output should be a key name, for example C major or E flat minor. Only pitch class numbers will be taken into account during evaluation, for instance C sharp major and D flat major will be considered equivalent.<br />
<br />
'''System Calibration''': The test set will be randomly split into training and test data. Training data will be provided to the participants so that they determine the optimal settings for the parameters of their algorithms.<br />
<br />
'''Evaluation ''': The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. The key of the piece is the one defined by the composer in the title of the piece. We will then determine how ΓÇÿcloseΓÇÖ each identified key is to the corresponding correct key. Keys will be considered as ΓÇÿcloseΓÇÖ if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. A correct key assignment will be given a full point, and incorrect assignments will be allocated fractions of a point according to the following table:<br />
<br />
{|<br />
|Relation to correct key ||Points<br />
|-<br />
|Same||1<br />
|-<br />
|Perfect fifth||0.5<br />
|-<br />
|Relative major/minor||0.3<br />
|-<br />
|Parallel major/minor||0.2<br />
|}<br />
<br />
'''Comments''': Many excellent suggestions were made in the review process. Some of the ideas included: using actual audio files from recordings for the audio portion of the contest, employing other metrics used in information retrieval literature, using test data from a wider variety of genres, and considering the detection of key modulations. <br />
<br />
As this is a first attempt at evaluating key-finding across different systems employing a variety of algorithm combinations, we have opted to keep the evaluation procedure as simple and streamlined as possible. The results of this contest will lay the groundwork from which we can expand the techniques for key-finding evaluation.<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Symbolic Data''': The dataset contains 500 classical music MIDI files selected from the Classical Music Archives (http://www.classicalarchives.com) and labelled with the key stated in their title.<br />
<br />
Examples of pieces include, but are not limited to, the following:<br />
<br />
Pieces from the Baroque period:<br />
Bach (http://www.classicalarchives.com/bach.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works.<br />
Vivaldi (http://www.classicalarchives.com/vivaldi.html) ΓÇô Concerti and Chamber Works.<br />
<br />
Pieces from the Classical period:<br />
Handel (http://www.classicalarchives.com/handel.html) ΓÇô Orchestral Works, Keyboard Works, and Chamber Works.<br />
Haydn (http://www.classicalarchives.com/haydn.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works.<br />
Mozart (http://www.classicalarchives.com/mozart.html) ΓÇô Keyboard Works, Symphonies and Concertos, and Chamber Works.<br />
Early Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.<br />
<br />
Pieces from the Romantic period:<br />
Late Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.<br />
Brahms (http://www.classicalarchives.com/brahms.html) ΓÇô Keyboard Works, Chamber Works, Concertos and Orchestral Works.<br />
Chopin (http://www.classicalarchives.com/chopin.html) ΓÇô Piano Works.<br />
<br />
'''Audio Data''': The dataset contains the same pieces sythesized from MIDI to CD-quality (16-bit, 44100 Hz, mono) WAV files using various software MIDI synthesizers (Winamp, Cakewalk, etc). The synthetizer for each piece was selected randomly.<br />
<br />
By using the same data for both the symbolic and audio key-finding methods, we will be able to evaluate and compare both approaches. It should be noted that even though synthesized MIDI is a simple alternative to actual audio, it is an appropriate approach for an evaluation where we are considering both audio and symbolic algorithms. Also, this controlled method eliminates possible tuning issues that are sometimes present in recorded audio.<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan.<br />
So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". <br />
<br />
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on. <br />
<br />
<br />
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major. <br />
<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose. <br />
<br />
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].<br />
<br />
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]<br />
<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.<br />
<br />
<br />
==Emilia's Comments==<br />
<br />
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:<br />
<br />
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...<br />
<br />
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant? <br />
<br />
[Hendrik 02.26.05]: matlab would be very convenient. <br />
<br />
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.<br />
<br />
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.<br />
<br />
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation. <br />
<br />
Best regards and thanks,</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=8012005:Audio and Symbolic Key2005-04-05T15:58:35Z<p>138.37.33.58: /* Relevant Test Collections */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es) [moderate]<br />
* Steffen Pauws (steffen.pauws@philips.com) [high]<br />
* Ozgur Izmirli (oizm@conncoll.edu) [moderate]<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg) [low]<br />
<br />
'''Symbolic Key-Finding''':<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi) [high]<br />
* Ming Li (mli@cmp.uea.ac.uk) and Ronan Sleep (mrs@cmp.uea.ac.uk) [high]<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* David Temperley (dtemp@theory.esm.rochester.edu) and Daniel Sleator (sleator@cs.cmu.edu) [high]<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi) [high]<br />
* Craig Sapp (craig@ccrma.stanford.edu) [moderate]<br />
<br />
==Evaluation Procedures==<br />
<br />
'''Test Set''': The test set we propose to use will consist of pieces for which the keys are known. For example, symphonies and concertos by well-known composers often have the keys stated in the title of the piece. The excerpts will typically be the beginnings of the pieces as this is one part of the piece for which establishing of the global and known key can be guaranteed. Different excerpt durations will be considered: 30 seconds, 20 seconds and 10 seconds.<br />
<br />
'''Input/Output''': The input to the system should be some musical excerpt (either audio or MIDI) and the output should be a key name, for example C major or E flat minor. Only pitch class numbers will be taken into account during evaluation, for instance C sharp major and D flat major will be considered equivalent.<br />
<br />
'''System Calibration''': The test set will be randomly split into training and test data. Training data will be provided to the participants so that they determine the optimal settings for the parameters of their algorithms.<br />
<br />
'''Evaluation ''': The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. The key of the piece is the one defined by the composer in the title of the piece. We will then determine how ΓÇÿcloseΓÇÖ each identified key is to the corresponding correct key. Keys will be considered as ΓÇÿcloseΓÇÖ if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. A correct key assignment will be given a full point, and incorrect assignments will be allocated fractions of a point according to the following table:<br />
<br />
{|<br />
|Relation to correct key ||Points<br />
|-<br />
|Same||1<br />
|-<br />
|Perfect fifth||0.5<br />
|-<br />
|Relative major/minor||0.3<br />
|-<br />
|Parallel major/minor||0.2<br />
|}<br />
<br />
'''Comments''': Many excellent suggestions were made in the review process. Some of the ideas included: using actual audio files from recordings for the audio portion of the contest, employing other metrics used in information retrieval literature, using test data from a wider variety of genres, and considering the detection of key modulations. <br />
<br />
As this is a first attempt at evaluating key-finding across different systems employing a variety of algorithm combinations, we have opted to keep the evaluation procedure as simple and streamlined as possible. The results of this contest will lay the groundwork from which we can expand the techniques for key-finding evaluation.<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Symbolic Data''': The dataset contains 500 classical music MIDI files selected from the Classical Music Archives (http://www.classicalarchives.com) and labelled with the key stated in their title.<br />
<br />
Examples of pieces include, but are not limited to, the following:<br />
<br />
Pieces from the Baroque period:<br />
Bach (http://www.classicalarchives.com/bach.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works.<br />
Vivaldi (http://www.classicalarchives.com/vivaldi.html) ΓÇô Concerti and Chamber Works.<br />
<br />
Pieces from the Classical period:<br />
Handel (http://www.classicalarchives.com/handel.html) ΓÇô Orchestral Works, Keyboard Works, and Chamber Works.<br />
Haydn (http://www.classicalarchives.com/haydn.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works.<br />
Mozart (http://www.classicalarchives.com/mozart.html) ΓÇô Keyboard Works, Symphonies and Concertos, and Chamber Works.<br />
Early Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.<br />
<br />
Pieces from the Romantic period:<br />
Late Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.<br />
Brahms (http://www.classicalarchives.com/brahms.html) ΓÇô Keyboard Works, Chamber Works, Concertos and Orchestral Works.<br />
Chopin (http://www.classicalarchives.com/chopin.html) ΓÇô Piano Works.<br />
<br />
'''Audio Data''': The dataset contains the same pieces sythesized from MIDI to CD-quality (16-bit, 44100 Hz, mono) WAV files using various software MIDI synthesizers (Winamp, Cakewalk, etc). The synthetizer for each piece will be selected randomly.<br />
<br />
By using the same data for both the symbolic and audio key-finding methods, we will be able to evaluate and compare both approaches. It should be noted that even though synthesized MIDI is a simple alternative to actual audio, it is an appropriate approach for an evaluation where we are considering both audio and symbolic algorithms. Also, this controlled method eliminates possible tuning issues that are sometimes present in recorded audio.<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan.<br />
So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". <br />
<br />
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on. <br />
<br />
<br />
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major. <br />
<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose. <br />
<br />
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].<br />
<br />
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]<br />
<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.<br />
<br />
<br />
==Emilia's Comments==<br />
<br />
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:<br />
<br />
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...<br />
<br />
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant? <br />
<br />
[Hendrik 02.26.05]: matlab would be very convenient. <br />
<br />
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.<br />
<br />
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.<br />
<br />
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation. <br />
<br />
Best regards and thanks,</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=8002005:Audio and Symbolic Key2005-04-05T15:46:42Z<p>138.37.33.58: /* Evaluation Procedures */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es) [moderate]<br />
* Steffen Pauws (steffen.pauws@philips.com) [high]<br />
* Ozgur Izmirli (oizm@conncoll.edu) [moderate]<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg) [low]<br />
<br />
'''Symbolic Key-Finding''':<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi) [high]<br />
* Ming Li (mli@cmp.uea.ac.uk) and Ronan Sleep (mrs@cmp.uea.ac.uk) [high]<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* David Temperley (dtemp@theory.esm.rochester.edu) and Daniel Sleator (sleator@cs.cmu.edu) [high]<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi) [high]<br />
* Craig Sapp (craig@ccrma.stanford.edu) [moderate]<br />
<br />
==Evaluation Procedures==<br />
<br />
'''Test Set''': The test set we propose to use will consist of pieces for which the keys are known. For example, symphonies and concertos by well-known composers often have the keys stated in the title of the piece. The excerpts will typically be the beginnings of the pieces as this is one part of the piece for which establishing of the global and known key can be guaranteed. Different excerpt durations will be considered: 30 seconds, 20 seconds and 10 seconds.<br />
<br />
'''Input/Output''': The input to the system should be some musical excerpt (either audio or MIDI) and the output should be a key name, for example C major or E flat minor. Only pitch class numbers will be taken into account during evaluation, for instance C sharp major and D flat major will be considered equivalent.<br />
<br />
'''System Calibration''': The test set will be randomly split into training and test data. Training data will be provided to the participants so that they determine the optimal settings for the parameters of their algorithms.<br />
<br />
'''Evaluation ''': The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. The key of the piece is the one defined by the composer in the title of the piece. We will then determine how ΓÇÿcloseΓÇÖ each identified key is to the corresponding correct key. Keys will be considered as ΓÇÿcloseΓÇÖ if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. A correct key assignment will be given a full point, and incorrect assignments will be allocated fractions of a point according to the following table:<br />
<br />
{|<br />
|Relation to correct key ||Points<br />
|-<br />
|Same||1<br />
|-<br />
|Perfect fifth||0.5<br />
|-<br />
|Relative major/minor||0.3<br />
|-<br />
|Parallel major/minor||0.2<br />
|}<br />
<br />
'''Comments''': Many excellent suggestions were made in the review process. Some of the ideas included: using actual audio files from recordings for the audio portion of the contest, employing other metrics used in information retrieval literature, using test data from a wider variety of genres, and considering the detection of key modulations. <br />
<br />
As this is a first attempt at evaluating key-finding across different systems employing a variety of algorithm combinations, we have opted to keep the evaluation procedure as simple and streamlined as possible. The results of this contest will lay the groundwork from which we can expand the techniques for key-finding evaluation.<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Symbolic Data''':<br />
MIDI Collections: MIDI data are an event-based representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. <br />
<br />
'''Audio Data''':<br />
Synthesized MIDI: Audio data can be generated by synthesizing the MIDI data proposed above. By using the same data for both the symbolic and audio key-finding methods, we will be able to evaluate and compare both approaches. It should be noted that even though synthesized MIDI is a simple alternative to actual audio, it is an appropriate approach for an evaluation where we are considering both audio and symbolic algorithms. Also, this controlled method eliminates possible tuning issues that are sometimes present in recorded audio.<br />
<br />
Audio-from-MIDI data can be synthesized using either software or hardware. The software synthesizers include freeware such as Winamp and commercial software such as Cakewalk. The hardware synthesizers, for instance, a Roland XV5080, can receive MIDI commands and use built-in synthesizers to produce more realistic sound.<br />
<br />
'''Test Data''':<br />
The test data can be obtained from the Classical Archive website http://www.classicalarchives.com). This site provides a large collection of classical music. Examples of pieces with labeled keys appropriate for the test data set include, but are not limited to, the following:<br />
<br />
Pieces from the Baroque period:<br />
Bach (http://www.classicalarchives.com/bach.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works.<br />
Vivaldi (http://www.classicalarchives.com/vivaldi.html) ΓÇô Concerti and Chamber Works.<br />
<br />
Pieces from the Classical period:<br />
Handel (http://www.classicalarchives.com/handel.html) ΓÇô Orchestral Works, Keyboard Works, and Chamber Works.<br />
Haydn (http://www.classicalarchives.com/haydn.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works.<br />
Mozart (http://www.classicalarchives.com/mozart.html) ΓÇô Keyboard Works, Symphonies and Concertos, and Chamber Works.<br />
Early Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.<br />
<br />
Pieces from the Romantic period:<br />
Late Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.<br />
Brahms (http://www.classicalarchives.com/brahms.html) ΓÇô Keyboard Works, Chamber Works, Concertos and Orchestral Works.<br />
Chopin (http://www.classicalarchives.com/chopin.html) ΓÇô Piano Works.<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan.<br />
So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". <br />
<br />
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on. <br />
<br />
<br />
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major. <br />
<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose. <br />
<br />
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].<br />
<br />
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]<br />
<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.<br />
<br />
<br />
==Emilia's Comments==<br />
<br />
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:<br />
<br />
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...<br />
<br />
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant? <br />
<br />
[Hendrik 02.26.05]: matlab would be very convenient. <br />
<br />
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.<br />
<br />
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.<br />
<br />
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation. <br />
<br />
Best regards and thanks,</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3982005:Audio Onset Detect2005-03-24T13:27:11Z<p>138.37.33.58: /* Evaluation Procedures */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data are monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* file length between 8 and 15 seconds<br />
<br />
''Audio content'':<br />
<br />
The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.<br />
<br />
The dataset contains 100 files from 5 classes annotated as follows:<br />
* 30 solo drum excerpts cross-annotated by 3 people<br />
* 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people<br />
* 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people<br />
* 15 complex mixes cross-annotated by 5 people<br />
* 15 complex mixes synthesized from MIDI<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr>,<br />
Geoffroy Peeters <peeters@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
*Centre for Music and Science, Cambridge<br />
Nick Collins <nc272 at cam dot ac dot uk><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). Because files are cross-annotated, the mean CD and FP rates are defined by averaging CD and FP rates computed for each annotation.<br />
<br />
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data are recordings made by MTG at UPF Barcelona and excerpts from the RWC database. MIDI data are excerpts from the RWC database. Annotations were conducted by the Centre for Digital Music at QMU London, MTG at UPF Barcelona and Musical Acoustics Lab at Paris 6 University. MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.<br />
<br />
The defined ground-truth can be critical for the evaluation. For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock. For real world sounds, precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation. <br />
<br />
Article about annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3972005:Audio Onset Detect2005-03-24T13:21:02Z<p>138.37.33.58: /* Relevant Test Collections */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data are monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* file length between 8 and 15 seconds<br />
<br />
''Audio content'':<br />
<br />
The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.<br />
<br />
The dataset contains 100 files from 5 classes annotated as follows:<br />
* 30 solo drum excerpts cross-annotated by 3 people<br />
* 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people<br />
* 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people<br />
* 15 complex mixes cross-annotated by 5 people<br />
* 15 complex mixes synthesized from MIDI<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr>,<br />
Geoffroy Peeters <peeters@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
*Centre for Music and Science, Cambridge<br />
Nick Collins <nc272 at cam dot ac dot uk><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). <br />
<br />
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data are recordings made by MTG at UPF Barcelona and excerpts from the RWC database. MIDI data are excerpts from the RWC database. Annotations were conducted by the Centre for Digital Music at QMU London, MTG at UPF Barcelona and Musical Acoustics Lab at Paris 6 University. MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.<br />
<br />
The defined ground-truth can be critical for the evaluation. For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock. For real world sounds, precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation. <br />
<br />
Article about annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3962005:Audio Onset Detect2005-03-24T13:19:57Z<p>138.37.33.58: /* Description */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data are monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* file length between 8 and 15 seconds<br />
<br />
''Audio content'':<br />
<br />
The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.<br />
<br />
The dataset contains 100 files from 5 classes annotated as follows:<br />
* 30 solo drum excerpts cross-annotated by 3 people<br />
* 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people<br />
* 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people<br />
* 15 complex mixes cross-annotated by 5 people<br />
* 15 complex mixes synthesized from MIDI<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr>,<br />
Geoffroy Peeters <peeters@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
*Centre for Music and Science, Cambridge<br />
Nick Collins <nc272 at cam dot ac dot uk><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). <br />
<br />
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...<br />
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...<br />
<br />
Some training data is available at: http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm. It is composed of amateur recordings and RWC DB excerpts.<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3952005:Audio Onset Detect2005-03-24T13:15:25Z<p>138.37.33.58: /* Description */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data are monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* file length between 8 and 15 seconds<br />
<br />
''Audio content'':<br />
<br />
The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.<br />
<br />
The dataset contains 100 files from 5 classes annotated as follows:<br />
* 30 solo drum excerpts cross-annotated by 3 people<br />
* 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people<br />
* 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people<br />
* 15 complex mixes cross-annotated by 5 people<br />
* 15 complex mixes synthesized from MIDI<br />
<br />
Audio data are recordings made by MTG at UPF Barcelona and excerpts from the RWC database. MIDI data are excerpts from the RWC database. Annotations were conducted by the Centre for Digital Music at QMU London, MTG at UPF Barcelona and Musical Acoustics Lab at Paris 6 University. MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.<br />
<br />
''Notes on annotation'':<br />
<br />
The defined ground-truth can be critical for the evaluation.<br />
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.<br />
For real world sounds, precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation. <br />
Article and matlab tool for annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr>,<br />
Geoffroy Peeters <peeters@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
*Centre for Music and Science, Cambridge<br />
Nick Collins <nc272 at cam dot ac dot uk><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). <br />
<br />
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...<br />
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...<br />
<br />
Some training data is available at: http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm. It is composed of amateur recordings and RWC DB excerpts.<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3942005:Audio Onset Detect2005-03-24T13:14:57Z<p>138.37.33.58: /* Description */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data are monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* file length between 8 and 15 seconds<br />
<br />
''Audio content'':<br />
<br />
The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.<br />
<br />
The dataset contains 100 files from 5 classes annotated as follows:<br />
- 30 solo drum excerpts cross-annotated by 3 people<br />
- 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people<br />
- 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people<br />
- 15 complex mixes cross-annotated by 5 people<br />
- 15 complex mixes synthesized from MIDI<br />
<br />
Audio data are recordings made by MTG at UPF Barcelona and excerpts from the RWC database. MIDI data are excerpts from the RWC database. Annotations were conducted by the Centre for Digital Music at QMU London, MTG at UPF Barcelona and Musical Acoustics Lab at Paris 6 University. MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.<br />
<br />
''Notes on annotation'':<br />
<br />
The defined ground-truth can be critical for the evaluation.<br />
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.<br />
For real world sounds, precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation. <br />
Article and matlab tool for annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr>,<br />
Geoffroy Peeters <peeters@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
*Centre for Music and Science, Cambridge<br />
Nick Collins <nc272 at cam dot ac dot uk><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). <br />
<br />
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...<br />
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...<br />
<br />
Some training data is available at: http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm. It is composed of amateur recordings and RWC DB excerpts.<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3932005:Audio Onset Detect2005-03-24T12:16:30Z<p>138.37.33.58: /* Potential Participants */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data will be monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* file length between 8 and 15 seconds<br />
<br />
''Audio content'':<br />
<br />
The dataset will be subdivided into classes. This idea has been evoked by D. Ellis at last MIREX. The reasons why:<br />
* onset detection are performed in various applications, some of them are dedicated for a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...)<br />
* the composition of the entire database can determine the relative rank of the onset detection algorithms. For example, an evaluation of a dataset principally composed of complex mixes will not emphasize an onset detection performing well on solo phrases of bowed strings, but a little less than the others on complex mixes.<br />
* it can show the weak points of the compared methods. I think it is more useful than an evaluation based on an overall success percentage or curve. <br />
The 5 following classes will be considered:<br />
* monophonic instruments solo phrases<br />
* polyphonic instruments solo phrases<br />
* complex mixes<br />
<br />
<br />
''Meta data'':<br />
<br />
Two types of annotation can be provided:<br />
* Manual annotation for the real word sounds. For this type of annotation, our article mentions these potential difficulties:<br />
* Midi score for synthesized sounds or MIDI commanded instruments. They are considered as robust ground-truth.<br />
<br />
''Notes on annotation'':<br />
<br />
As mentioned above, the sound files will be provided with their onset time annotation. The ground-truth we will define can be critical for the evaluation.<br />
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.<br />
For real world sounds, annotation volunteers are needed. The annotations should be cross-validated (errare humanum est). Precise instructions on which events to annotate must be given to the annotators.<br />
Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.How the annotation is taken into account must be precisely defined... my opinion is to discard sound events that are not music notes, for example breathing, key strokes etc..., that are quite frequent in the solo recordings, even if they're detected by most of the onset detection algorithms...<br />
<br />
Article and matlab tool for annotation by Pierre Leveau et al.<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr>,<br />
Geoffroy Peeters <peeters@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
*Centre for Music and Science, Cambridge<br />
Nick Collins <nc272 at cam dot ac dot uk><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). <br />
<br />
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...<br />
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...<br />
<br />
Some training data is available at: http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm. It is composed of amateur recordings and RWC DB excerpts.<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3922005:Audio Onset Detect2005-03-24T12:15:33Z<p>138.37.33.58: /* Description */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data will be monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* file length between 8 and 15 seconds<br />
<br />
''Audio content'':<br />
<br />
The dataset will be subdivided into classes. This idea has been evoked by D. Ellis at last MIREX. The reasons why:<br />
* onset detection are performed in various applications, some of them are dedicated for a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...)<br />
* the composition of the entire database can determine the relative rank of the onset detection algorithms. For example, an evaluation of a dataset principally composed of complex mixes will not emphasize an onset detection performing well on solo phrases of bowed strings, but a little less than the others on complex mixes.<br />
* it can show the weak points of the compared methods. I think it is more useful than an evaluation based on an overall success percentage or curve. <br />
The 5 following classes will be considered:<br />
* monophonic instruments solo phrases<br />
* polyphonic instruments solo phrases<br />
* complex mixes<br />
<br />
<br />
''Meta data'':<br />
<br />
Two types of annotation can be provided:<br />
* Manual annotation for the real word sounds. For this type of annotation, our article mentions these potential difficulties:<br />
* Midi score for synthesized sounds or MIDI commanded instruments. They are considered as robust ground-truth.<br />
<br />
''Notes on annotation'':<br />
<br />
As mentioned above, the sound files will be provided with their onset time annotation. The ground-truth we will define can be critical for the evaluation.<br />
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.<br />
For real world sounds, annotation volunteers are needed. The annotations should be cross-validated (errare humanum est). Precise instructions on which events to annotate must be given to the annotators.<br />
Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.How the annotation is taken into account must be precisely defined... my opinion is to discard sound events that are not music notes, for example breathing, key strokes etc..., that are quite frequent in the solo recordings, even if they're detected by most of the onset detection algorithms...<br />
<br />
Article and matlab tool for annotation by Pierre Leveau et al.<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
*Centre for Music and Science, Cambridge<br />
Nick Collins <nc272 at cam dot ac dot uk><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). <br />
<br />
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...<br />
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...<br />
<br />
Some training data is available at: http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm. It is composed of amateur recordings and RWC DB excerpts.<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3902005:Audio Onset Detect2005-03-15T15:06:09Z<p>138.37.33.58: /* Potential Participants */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) '''Input data'''<br />
<br />
''Audio format'':<br />
<br />
The data will be monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* the file length is not critical for that task, but 30 seconds max. excerpts would be convenient if we want to have a correct diversity in the dataset. It must be reminded that real-world sounds must be manually annotated (painful and time-consuming task, as pointed by J. Bello at MIREX 2004).<br />
<br />
''Audio content'':<br />
<br />
The dataset will be subdivided into classes. This idea has been evoked by D. Ellis at last MIREX. The reasons why:<br />
* onset detection are performed in various applications, some of them are dedicated for a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...)<br />
* the composition of the entire database can determine the relative rank of the onset detection algorithms. For example, an evaluation of a dataset principally composed of complex mixes will not emphasize an onset detection performing well on solo phrases of bowed strings, but a little less than the others on complex mixes.<br />
* it can show the weak points of the compared methods. I think it is more useful than an evaluation based on an overall success percentage or curve. <br />
The 3 following classes will be considered:<br />
* monophonic instruments solo phrases<br />
* polyphonic instruments solo phrases<br />
* complex mixes<br />
<br />
<br />
''Meta data'':<br />
<br />
Two types of annotation can be provided:<br />
* Manual annotation for the real word sounds. For this type of annotation, our article mentions these potential difficulties:<br />
* Midi score for synthesized sounds or MIDI commanded instruments. They are considered as robust ground-truth.<br />
<br />
''Notes on annotation'':<br />
<br />
As mentioned above, the sound files will be provided with their onset time annotation. The ground-truth we will define can be critical for the evaluation.<br />
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.<br />
For real world sounds, annotation volunteers are needed. The annotations should be cross-validated (errare humanum est). Precise instructions on which events to annotate must be given to the annotators.<br />
Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.How the annotation is taken into account must be precisely defined... my opinion is to discard sound events that are not music notes, for example breathing, key strokes etc..., that are quite frequent in the solo recordings, even if they're detected by most of the onset detection algorithms...<br />
<br />
Article and matlab tool for annotation by Pierre Leveau et al.<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm<br />
<br />
2) '''Output data'''<br />
<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr>,<br />
Axel Roebel <roebel@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
* Indian Institute of Science,Bangalore<br />
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in><br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP). <br />
<br />
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.<br />
<br />
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.<br />
<br />
<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
==Relevant Test Collections==<br />
<br />
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...<br />
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...<br />
<br />
Some training data is available at: http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm. It is composed of amateur recordings and RWC DB excerpts.<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.<br />
<br />
==Downie's Comments==<br />
<br />
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.<br />
<br />
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Drum_Det&diff=2912005:Audio Drum Det2005-03-10T10:34:58Z<p>138.37.33.58: /* Potential Participants */</p>
<hr />
<div>==Proposer==<br />
<br />
Koen Tanghe (Ghent University) koen [dot] tanghe [at] ugent [dot] be<br />
<br />
==Title==<br />
<br />
Drum detection from polyphonic audio.<br />
<br />
<br />
==Description==<br />
<br />
The task consists of determining the positions (localization) and corresponding drum class names (labeling) of drum events in polyphonic music. This is very interesting rhythmic information for the popular music genres nowadays, can help in determining tempo and (sub)genre, and can also be queried for directly (typical rhythmic sequences/patterns).<br />
<br />
1) Input data<br />
The only input for this task is a set of sound file excerpts adhering to the format and content requirements mentioned below.<br />
<br />
Audio format:<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* mono and stereo<br />
* 30 seconds excerpts<br />
* files are named as "001.wav" to "999.wav" (or with another extension depending on the chosen format)<br />
<br />
Audio content:<br />
* polyphonic music with drums (most)<br />
* polyphonic music without drums (some)<br />
* different genres / playing styles<br />
* both live performances and sequenced music<br />
* different types of drum sets (acoustic, electronic, ...)<br />
* at least 50 files<br />
* participants receive at least 10 files in advance<br />
<br />
[Perfe 02/25/05: I would vote for mono and more than 50 files; the 10 files to be given to participants should be randomly drawn from the total pool of N available annotated files unless there is some bias in the collection that is related to genre or other important class; I mean, if there are 30% of electronic percussion files, then 3 out of 10 files should contain that. In that case, a stratified sampling should be used]<br />
<br />
[Masataka 03/07/2005: I agree with the above Perfe's comments. In addition, our team prefers to use the whole songs (not excerpts). I would like to make sure that the input audio signals contain sounds of<br />
various musical instruments (some of them include vocals, too), and that the actual drum sounds (sound samples) included in the input mixture are not known in advance because we have to deal with those situations in practical applications.]<br />
<br />
2) Output results<br />
The output of this task is, for each sound file, an ASCII text file containing 2 columns, where each line represents a drum event. The first column is the position (in seconds) of the drum event, and the second column is the label for the drum event at that position. Multiple drum events may occur at the same time, so there may be multiple lines having the same value in the first column. The file names of the output files are the same as the audio files, but the extension is ".txt" (so: "001.txt" for "001.wav").<br />
<br />
Classes and labels that are considered:<br />
* BD (bass drum)<br />
* SD (snare drum)<br />
* HH (hihat)<br />
* CY (cymbal)<br />
* TM (tom)<br />
<br />
[Perfe 02/25/05: What about adding an "other" class? How are we going to manage the combination of sounds?]<br />
<br />
[Masataka 03/07/2005: How about adding the option of evaluating only BD, SD, and HH?]<br />
<br />
==Potential Participants==<br />
<br />
* Vegard Sandvold (Notam), Fabien Gouyon (MTG, University of Pompeu Fabra), Perfecto Herrera (UPF) <br />
vegardsa[at]student[dot]matnat[dot]uio[dot]no, fabien[dot]gouyon[at]iua[dot]upf[dot]es, perfe[at]iua[dot]upf[dot]es, likely<br />
* Koen Tanghe (IPEM, Ghent University)<br />
Koen[dot]Tanghe[at]UGent[dot]be, highly likely<br />
* Christian Uhle (Fraunhofer)<br />
uhle[at]idmt[dot]fraunhofer[dot]de, ???<br />
* Anssi Klapuri (Tampere University of Technology)<br />
klap[at]cs[dot]tut[dot]fi, not participating (Jouni Paulus represents our group)<br />
* Kazuyoshi Yoshii (Kyoto University), Masataka Goto (AIST), Hiroshi G. Okuno (Kyoto University)<br />
yoshii[at]kuis[dot]kyoto-u[dot]ac[dot]jp, m.goto[at]aist[dot]go[dot]jp, okuno[at]i[dot]kyoto-u[dot]ac[dot]jp, highly likely<br />
* Derry FitzGerald (Cork Institute of Technology)<br />
derry[dot]fitzgerald[at]cit[dot]ie, likely<br />
* Gaël Richard and Olivier Gillet (Telecom Paris)<br />
gael[dot]richard[at]enst[dot]fr, olivier[dot]gilllet[at]enst[dot]fr very likely<br />
* Jouni Paulus (Tampere University of Techonology)<br />
jouni[dot]paulus[at]tut[dot]fi, moderately likely<br />
* George Tzanetakis (University of Victoria, Canada)<br />
gtzan[at]cs[dot]uvic[dot]ca, moderately likely<br />
<br />
==Evaluation Procedures==<br />
<br />
Comparison rules:<br />
Questions to be answered:<br />
* when do we consider a detected event as "correct"?<br />
* when do we consider a detected event as "false"?<br />
* when do we consider a ground truth event as "missed"?<br />
* what's the maximum difference in time between real drum event position and detected drum event position that can be allowed?<br />
* is detecting an event at a valid ground truth position but classifying it incorrectly as bad as not detecting the event at all?<br />
<br />
Evaluation measures:<br />
which performance measure are we going to use? precision, recall, accuracy, F-measure, ...?<br />
<br />
Drum detection may have several goals and thus the evaluation should reflect the algorithms relatively to the initial goal or application. In our case, I believe that the interest is "obtaining metadata that describe the drum track of a file". Then in this context, the ideal would be a kind of perceptual distance in the metadata domain but is there such distances ? is it possible to define one without conducting lengthly perceptual experiments ?<br />
One possibility would be to use a distance similar to the one we have used for our drum loop query system (to be soon published in the special issue of JIIS). The basic idea is to compute an "edition distance" between the obtained metadata and ground truth metadata strings. The edition distance computes deletion, insertion and confusion but also takes into account desyncrhonisation between events and allow to associate coefficients for confusions (for example it is often less dramatic to miss a charley hit than a bass drum hit....).<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Ground truth annotations:<br />
<br />
For each sound file to be analyzed, there is a corresponding annotation file using the same format as described in "3. Output". The ground truth files are obtained by manual annotation by people who have experience with drum sounds (drummers?).<br />
RWC and Magnatune are potential excellent sources. For annotation, it would be important to include annotation cross-check (several drummers 3 would be an ideal minimum annotate the same files). This would be quite similar to the methodology that we have followed for Onset detection evaluation (see P. Leveau paper at last ISMIR). This would permit to have an excellent ground truth annotation, would also permit to evaluate which kind of confusions are never done and which ones are often done, what is the acceptable maximum difference in time between real and detected drum events,etc... However, as always, this requires more efforts and time.<br />
Another option (for sequenced music) is to use *audio recordings* of MIDI sequences, and use the drum tracks of the MIDI files to obtain the ground truth annotations.<br />
<br />
[Perfe 02/25/05: In case of using MIDI files, I'd suggest to add some "human touch" midi post-processing, plus some audio production basic tricks such as compression and reverb, in order to make the audio as much close as possible to the complexity of real recordings; I would not use more than a 30% of midi files, if needed]<br />
<br />
[Masataka 03/07/2005: For annotation, I've been working on labeling all<br />
the onset times of BD, SD, and HH on more than 50 songs in the RWC Music Database (RWC-MDB-P-2001). I have a plan to put them on http://staff.aist.go.jp/m.goto/RWC-MDB/ so that they can be available for RWC-MDB users.]<br />
<br />
==Review 1==<br />
<br />
Problem is both clearly defined and interesting in terms of current research.<br />
<br />
Audio format and content are fine, however, it would be nice to include more than 50 files, although this would probably make the transcription task too difficult/time consuming. Either Mono or Stereo recordings should be chosen, I suggest polling participants to see if anyone intends to use stereo information or whether all participants will down-mix to mono. There is no mention of transcribed datasets so this will have to be done from scratch and therefore the proposed use of the RWC or Magnatune databases is a good idea. I am unsure whether the use of synthesized midi files is valid unless they are produced using samples rather than synthesized drum sounds and even then you would need to use several different samples of each sound to ensure enough variance for a proper evaluation. I agree that ground truth annotations should be produced by 2-3 non-participating transcribers.<br />
<br />
The output result format is fine, however, there maybe more classes of drum/percussive sounds that should be considered, such as maracas or a tambourine. Obviously this will depend on the content of audio files used and could form an abstract grouping if there are insufficient training examples for separate groups.<br />
<br />
Evaluation procedures contains more questions than answers. Obviously this task is quite dependent on the onset detection/segmentation. Paul Brossier proposed that for the onset detection evaluation events detected within 50ms of the transcribed position should be considered correct. I assume this holds for the drum detection proposal. I think it would be interesting and not too taxing to have two tracks one supplying the ground-truth segmentation, requiring only the classification of detected events and another performing the whole task.<br />
<br />
Will submissions be run once or cross-validated? As there is going to be a very small dataset a high number of folds should be used, although this should be limited so that every fold contains at least one example of each class.<br />
<br />
F-measure (mean and variance for cross validated results) would seem to be the most applicable evaluation metric if the whole task is performed. Will precision and recall be given equal weighting in the F-measure? See Speech & Language Processing, Jurafsky and Martin, 2000, p.578, the generalization of F-measure - F = (b2+1)PR/(b^2P + R) When b=1, P and R have equal weight. b>1 gives more weight to P, b<1 to R. A simple accuracy result would be fine if segmentation is supplied. Statistical significance of differences between algorithms should be estimated and it would be interesting to see statistical significance of differences between using ground-truth segmentation and the detected segmentation, thereby allowing us to assess whether the segmentation or event classification were at fault.<br />
<br />
Finally, given the list of potential participants and their publications, I think we can be confident of sufficient participation to run the evaluation.<br />
<br />
Recommendation: Refine proposal and accept<br />
<br />
[Masataka 03/07/2005: I would also vote for the use of F-measure that is the harmonic mean of the recall rate and the precision rate. I think the above-mentioned 50ms threshold for onset-deviation errors is too large for drum sounds: how about using 25ms, for example? We found it sufficient and appropriate when evaluating our method in our ISMIR 2004 paper: http://staff.aist.go.jp/m.goto/PAPER/ISMIR2004yoshii.pdf]<br />
<br />
==Review 2==<br />
<br />
The problem is well described and its applications are of great concern to the MIR community. However the evaluation procedures and test data contain more questions than answers. The proposal should be much more affirmative. Precise evaluation metrics need to be defined (so that every participant can implement them in a reproducible way), and the choice of the test data has to be discussed (is it relevant to test algorithms on MIDI data only if different synthesizers are used, or is it necessary to use audio data ?). This proposal is not mature enough now and the participants should provide some effort to improve it.<br />
<br />
Another issue is that the problem is not MIR in itself, but rather mid-level sound description. If the main applications are tempo induction and subgenre classification, why not evaluate the performance for these applications directly ? This would be more relevant for MIR and annotation would be far less time-consuming. I think this issue has to be seriously considered by the participants in case they do not own already a sufficient amount of annotated data.<br />
<br />
[Perfe 02/25/02: I do not agree that it is not MIR. It is MIR and it is high-level description. The main application is knowing if the song has drums, if there are lots of drums or only some spare hits, if there are lots of cymbals or not (hence, some genres could be discarded). The direct application, on the other hand, is still a bit far, as there are some perceptual issues involved, and perceptual issues require some time to be sorted out]<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued by the idea that MIDI or some other symbolic representation could be used to bring together the generation and ground truth tasks. Where does quantization fit into this (i.e., it is hard to "swing" midi files)?<br />
<br />
2. If MIDI files used for generation/ground truth, would it be necessary to introduce background music to make the task more difficult? I suppose the MIDI file could generate the background music also.....wonder if there are some other tricks we might be missing.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=7962005:Audio and Symbolic Key2005-02-28T21:00:50Z<p>138.37.33.58: /* Relevant Test Collections */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es) [high]<br />
* Steffen Pauws (steffen.pauws@philips.com) [high]<br />
* Ozgur Izmirli (oizm@conncoll.edu) [moderate]<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg) [low]<br />
<br />
'''Symbolic Key-Finding''':<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi) [high]<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* David Temperley (dtemp@theory.esm.rochester.edu) and Daniel Sleator (sleator@cs.cmu.edu) [high]<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi) [high]<br />
* Craig Sapp (craig@ccrma.stanford.edu) [moderate]<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms.<br />
<br />
'''Test Set''': The test set we propose to use will consist of 30 second excerpts of pieces for which the keys are known. For example, symphonies and concertos by well-known composers often have the keys stated in the title of the piece. The excerpts will typically be the beginnings of the pieces as this is one part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
'''Input/Output''': The input to the system should be some musical excerpt (either audio or MIDI) and the output should be a key name, for example, C major or E flat minor.<br />
<br />
'''System Calibration''': It is reasonable to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. The participants will be provided with training data that they may use in determining the optimal settings. The training data will be randomly selected and a representative subset of the actual test data that will be used in the evaluation process.<br />
<br />
'''Submission ''': The participants will provide the evaluation committee a copy of their system as well as the output from their algorithm for the training data. This will serve as a way to test how the algorithm performs in the evaluation environment.<br />
<br />
'''Evaluation ''': The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. The key of the piece is the one defined by the composer in the title of the piece. We will then determine how ΓÇÿcloseΓÇÖ each identified key is to the corresponding correct key. Keys will be considered as ΓÇÿcloseΓÇÖ if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. A correct key assignment will be given a full point, and incorrect assignments will be allocated fractions of a point according to the following table:<br />
<br />
{|<br />
|Relation to correct key ||Points<br />
|-<br />
|Same||1<br />
|-<br />
|Perfect fifth||0.5<br />
|-<br />
|Relative major/minor||0.3<br />
|-<br />
|Parallel major/minor||0.2<br />
|}<br />
<br />
The team with the highest total score will be the designated winner. Further details of how each system performed within each genre will also be provided.<br />
<br />
'''Comments''': Many excellent suggestions were made in the review process. Some of the ideas included: using actual audio files from recordings for the audio portion of the contest, employing other metrics used in information retrieval literature, using test data from a wider variety of genres, and considering the detection of key modulations. <br />
<br />
As this is a first attempt at evaluating key-finding across different systems employing a variety of algorithm combinations, we have opted to keep the evaluation procedure as simple and streamlined as possible. The results of this contest will lay the groundwork from which we can expand the techniques for key-finding evaluation.<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Symbolic Data''':<br />
MIDI Collections: MIDI data are an event-based representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. <br />
<br />
'''Audio Data''':<br />
Synthesized MIDI: Audio data can be generated by synthesizing the MIDI data proposed above. By using the same data for both the symbolic and audio key-finding methods, we will be able to evaluate and compare both approaches. It should be noted that even though synthesized MIDI is a simple alternative to actual audio, it is an appropriate approach for an evaluation where we are consider both audio and symbolic algorithms. Also, this controlled method eliminates possible tuning issues that are sometimes present in recorded audio.<br />
<br />
Audio-from-MIDI data can be synthesized using either software or hardware. The software synthesizers include freeware such as Winamp and commercial software such as Cakewalk. The hardware synthesizers, for instance, a Roland XV5080, can receive MIDI commands and use built-in synthesizers to produce more realistic sound.<br />
<br />
'''Test Data''':<br />
The test data can be obtained from the Classical Archive website http://www.classicalarchives.com). This site provides a large collection of classical music. Examples of pieces with labeled keys appropriate for the test data set include, but are not limited to, the following:<br />
<br />
Pieces from the Baroque period:<br />
Bach (http://www.classicalarchives.com/bach.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works.<br />
Vivaldi (http://www.classicalarchives.com/vivaldi.html) ΓÇô Concerti and Chamber Works.<br />
<br />
Pieces from the Classical period:<br />
Handel (http://www.classicalarchives.com/handel.html) ΓÇô Orchestral Works, Keyboard Works, and Chamber Works.<br />
Haydn (http://www.classicalarchives.com/haydn.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works.<br />
Mozart (http://www.classicalarchives.com/mozart.html) ΓÇô Keyboard Works, Symphonies and Concertos, and Chamber Works.<br />
Early Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.<br />
<br />
Pieces from the Romantic period:<br />
Late Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.<br />
Brahms (http://www.classicalarchives.com/brahms.html) ΓÇô Keyboard Works, Chamber Works, Concertos and Orchestral Works.<br />
Chopin (http://www.classicalarchives.com/chopin.html) ΓÇô Piano Works.<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan.<br />
So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". <br />
<br />
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on. <br />
<br />
<br />
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major. <br />
<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose. <br />
<br />
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].<br />
<br />
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]<br />
<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.<br />
<br />
<br />
==Emilia's Comments==<br />
<br />
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:<br />
<br />
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...<br />
<br />
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant? <br />
<br />
[Hendrik 02.26.05]: matlab would be very convenient. <br />
<br />
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.<br />
<br />
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.<br />
<br />
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation. <br />
<br />
Best regards and thanks,</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=7952005:Audio and Symbolic Key2005-02-28T20:57:28Z<p>138.37.33.58: /* Evaluation Procedures */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es) [high]<br />
* Steffen Pauws (steffen.pauws@philips.com) [high]<br />
* Ozgur Izmirli (oizm@conncoll.edu) [moderate]<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg) [low]<br />
<br />
'''Symbolic Key-Finding''':<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi) [high]<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* David Temperley (dtemp@theory.esm.rochester.edu) and Daniel Sleator (sleator@cs.cmu.edu) [high]<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi) [high]<br />
* Craig Sapp (craig@ccrma.stanford.edu) [moderate]<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms.<br />
<br />
'''Test Set''': The test set we propose to use will consist of 30 second excerpts of pieces for which the keys are known. For example, symphonies and concertos by well-known composers often have the keys stated in the title of the piece. The excerpts will typically be the beginnings of the pieces as this is one part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
'''Input/Output''': The input to the system should be some musical excerpt (either audio or MIDI) and the output should be a key name, for example, C major or E flat minor.<br />
<br />
'''System Calibration''': It is reasonable to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. The participants will be provided with training data that they may use in determining the optimal settings. The training data will be randomly selected and a representative subset of the actual test data that will be used in the evaluation process.<br />
<br />
'''Submission ''': The participants will provide the evaluation committee a copy of their system as well as the output from their algorithm for the training data. This will serve as a way to test how the algorithm performs in the evaluation environment.<br />
<br />
'''Evaluation ''': The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. The key of the piece is the one defined by the composer in the title of the piece. We will then determine how ΓÇÿcloseΓÇÖ each identified key is to the corresponding correct key. Keys will be considered as ΓÇÿcloseΓÇÖ if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. A correct key assignment will be given a full point, and incorrect assignments will be allocated fractions of a point according to the following table:<br />
<br />
{|<br />
|Relation to correct key ||Points<br />
|-<br />
|Same||1<br />
|-<br />
|Perfect fifth||0.5<br />
|-<br />
|Relative major/minor||0.3<br />
|-<br />
|Parallel major/minor||0.2<br />
|}<br />
<br />
The team with the highest total score will be the designated winner. Further details of how each system performed within each genre will also be provided.<br />
<br />
'''Comments''': Many excellent suggestions were made in the review process. Some of the ideas included: using actual audio files from recordings for the audio portion of the contest, employing other metrics used in information retrieval literature, using test data from a wider variety of genres, and considering the detection of key modulations. <br />
<br />
As this is a first attempt at evaluating key-finding across different systems employing a variety of algorithm combinations, we have opted to keep the evaluation procedure as simple and streamlined as possible. The results of this contest will lay the groundwork from which we can expand the techniques for key-finding evaluation.<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Audio Data''': Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
[ Perfe 02/24/05: I guess that having longer excerpts would be interesting too. If not, it could be the case that in those 30 seconds we would get a lot of modulations that blurr the results. On the other hand I would like to encourage the use of Magnatune and other Creative Commons license-based sites, and also the epitonic contents to overcome any "legal" issue.]<br />
<br />
<br />
'''MIDI Collections''': MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithmsΓÇÖ robustness to the various arrangements of instruments. <br />
<br />
[ Perfe 02/24/05: If we use some of those midi files to render audio versions (I would only use up to a 25% of midi-rendered files for the audio part of this contest), my suggestion is to "produce" the tracks a bit (i.e. adding reverb, doubling lines, etc.), specially if they are from pop music. "Simple" midi files, as pale renditions of a piece, will yield pale estimations of the capabilities of our algorithms]<br />
<br />
'''Score-based Collections''': Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan.<br />
So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". <br />
<br />
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on. <br />
<br />
<br />
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major. <br />
<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose. <br />
<br />
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].<br />
<br />
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]<br />
<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.<br />
<br />
<br />
==Emilia's Comments==<br />
<br />
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:<br />
<br />
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...<br />
<br />
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant? <br />
<br />
[Hendrik 02.26.05]: matlab would be very convenient. <br />
<br />
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.<br />
<br />
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.<br />
<br />
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation. <br />
<br />
Best regards and thanks,</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=7942005:Audio and Symbolic Key2005-02-28T20:52:41Z<p>138.37.33.58: /* Evaluation Procedures */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es) [high]<br />
* Steffen Pauws (steffen.pauws@philips.com) [high]<br />
* Ozgur Izmirli (oizm@conncoll.edu) [moderate]<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg) [low]<br />
<br />
'''Symbolic Key-Finding''':<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi) [high]<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* David Temperley (dtemp@theory.esm.rochester.edu) and Daniel Sleator (sleator@cs.cmu.edu) [high]<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi) [high]<br />
* Craig Sapp (craig@ccrma.stanford.edu) [moderate]<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms.<br />
<br />
'''Test Set''': The test set we propose to use will consist of 30 second excerpts of pieces for which the keys are known. For example, symphonies and concertos by well-known composers often have the keys stated in the title of the piece. The excerpts will typically be the beginnings of the pieces as this is one part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
'''Input/Output''': The input to the system should be some musical excerpt (either audio or MIDI) and the output should be a key name, for example, C major or E flat minor.<br />
<br />
'''System Calibration''': It is reasonable to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. The participants will be provided with training data that they may use in determining the optimal settings. The training data will be randomly selected and a representative subset of the actual test data that will be used in the evaluation process.<br />
<br />
'''Submission ''': The participants will provide the evaluation committee a copy of their system as well as the output from their algorithm for the training data. This will serve as a way to test how the algorithm performs in the evaluation environment.<br />
<br />
'''Evaluation ''': The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. The key of the piece is the one defined by the composer in the title of the piece. We will then determine how ΓÇÿcloseΓÇÖ each identified key is to the corresponding correct key. Keys will be considered as ΓÇÿcloseΓÇÖ if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. A correct key assignment will be given a full point, and incorrect assignments will be allocated fractions of a point according to the following table:<br />
<br />
Relation to correct key<br />
Points<br />
Same<br />
1<br />
Perfect fifth<br />
0.5<br />
Relative major/minor<br />
0.3<br />
Parallel major/minor<br />
0.2<br />
<br />
The team with the highest total score will be the designated winner. Further details of how each system performed within each genre will also be provided.<br />
<br />
'''Comments''': Many excellent suggestions were made in the review process. Some of the ideas included: using actual audio files from recordings for the audio portion of the contest, employing other metrics used in information retrieval literature, using test data from a wider variety of genres, and considering the detection of key modulations. <br />
<br />
As this is a first attempt at evaluating key-finding across different systems employing a variety of algorithm combinations, we have opted to keep the evaluation procedure as simple and streamlined as possible. The results of this contest will lay the groundwork from which we can expand the techniques for key-finding evaluation.<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Audio Data''': Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
[ Perfe 02/24/05: I guess that having longer excerpts would be interesting too. If not, it could be the case that in those 30 seconds we would get a lot of modulations that blurr the results. On the other hand I would like to encourage the use of Magnatune and other Creative Commons license-based sites, and also the epitonic contents to overcome any "legal" issue.]<br />
<br />
<br />
'''MIDI Collections''': MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithmsΓÇÖ robustness to the various arrangements of instruments. <br />
<br />
[ Perfe 02/24/05: If we use some of those midi files to render audio versions (I would only use up to a 25% of midi-rendered files for the audio part of this contest), my suggestion is to "produce" the tracks a bit (i.e. adding reverb, doubling lines, etc.), specially if they are from pop music. "Simple" midi files, as pale renditions of a piece, will yield pale estimations of the capabilities of our algorithms]<br />
<br />
'''Score-based Collections''': Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan.<br />
So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". <br />
<br />
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on. <br />
<br />
<br />
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major. <br />
<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose. <br />
<br />
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].<br />
<br />
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]<br />
<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.<br />
<br />
<br />
==Emilia's Comments==<br />
<br />
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:<br />
<br />
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...<br />
<br />
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant? <br />
<br />
[Hendrik 02.26.05]: matlab would be very convenient. <br />
<br />
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.<br />
<br />
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.<br />
<br />
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation. <br />
<br />
Best regards and thanks,</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=7932005:Audio and Symbolic Key2005-02-28T20:50:55Z<p>138.37.33.58: /* Evaluation Procedures */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es) [high]<br />
* Steffen Pauws (steffen.pauws@philips.com) [high]<br />
* Ozgur Izmirli (oizm@conncoll.edu) [moderate]<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg) [low]<br />
<br />
'''Symbolic Key-Finding''':<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi) [high]<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* David Temperley (dtemp@theory.esm.rochester.edu) and Daniel Sleator (sleator@cs.cmu.edu) [high]<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi) [high]<br />
* Craig Sapp (craig@ccrma.stanford.edu) [moderate]<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms.<br />
<br />
Test Set<br />
The test set we propose to use will consist of 30 second excerpts of pieces for which the keys are known. For example, symphonies and concertos by well-known composers often have the keys stated in the title of the piece. The excerpts will typically be the beginnings of the pieces as this is one part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
Input/Output<br />
The input to the system should be some musical excerpt (either audio or MIDI) and the output should be a key name, for example, C major or E flat minor.<br />
<br />
System Calibration<br />
It is reasonable to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. The participants will be provided with training data that they may use in determining the optimal settings. The training data will be randomly selected and a representative subset of the actual test data that will be used in the evaluation process.<br />
<br />
Submission <br />
The participants will provide the evaluation committee a copy of their system as well as the output from their algorithm for the training data. This will serve as a way to test how the algorithm performs in the evaluation environment.<br />
<br />
Evaluation <br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. The key of the piece is the one defined by the composer in the title of the piece.<br />
<br />
We will then determine how ΓÇÿcloseΓÇÖ each identified key is to the corresponding correct key. Keys will be considered as ΓÇÿcloseΓÇÖ if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor.<br />
<br />
A correct key assignment will be given a full point, and incorrect assignments will be allocated fractions of a point according to the following table:<br />
<br />
Relation to correct key<br />
Points<br />
Same<br />
1<br />
Perfect fifth<br />
0.5<br />
Relative major/minor<br />
0.3<br />
Parallel major/minor<br />
0.2<br />
<br />
The team with the highest total score will be the designated winner. Further details of how each system performed within each genre will also be provided.<br />
<br />
Comments<br />
Many excellent suggestions were made in the review process. Some of the ideas included: using actual audio files from recordings for the audio portion of the contest, <br />
employing other metrics used in information retrieval literature, using test data from a wider variety of genres, and considering the detection of key modulations. <br />
<br />
As this is a first attempt at evaluating key-finding across different systems employing a variety of algorithm combinations, we have opted to keep the evaluation procedure as simple and streamlined as possible. The results of this contest will lay the groundwork from which we can expand the techniques for key-finding evaluation.<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Audio Data''': Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
[ Perfe 02/24/05: I guess that having longer excerpts would be interesting too. If not, it could be the case that in those 30 seconds we would get a lot of modulations that blurr the results. On the other hand I would like to encourage the use of Magnatune and other Creative Commons license-based sites, and also the epitonic contents to overcome any "legal" issue.]<br />
<br />
<br />
'''MIDI Collections''': MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithmsΓÇÖ robustness to the various arrangements of instruments. <br />
<br />
[ Perfe 02/24/05: If we use some of those midi files to render audio versions (I would only use up to a 25% of midi-rendered files for the audio part of this contest), my suggestion is to "produce" the tracks a bit (i.e. adding reverb, doubling lines, etc.), specially if they are from pop music. "Simple" midi files, as pale renditions of a piece, will yield pale estimations of the capabilities of our algorithms]<br />
<br />
'''Score-based Collections''': Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan.<br />
So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". <br />
<br />
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on. <br />
<br />
<br />
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major. <br />
<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose. <br />
<br />
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].<br />
<br />
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]<br />
<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.<br />
<br />
<br />
==Emilia's Comments==<br />
<br />
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:<br />
<br />
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...<br />
<br />
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant? <br />
<br />
[Hendrik 02.26.05]: matlab would be very convenient. <br />
<br />
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.<br />
<br />
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.<br />
<br />
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation. <br />
<br />
Best regards and thanks,</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=7922005:Audio and Symbolic Key2005-02-28T20:49:24Z<p>138.37.33.58: /* Potential Participants */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es) [high]<br />
* Steffen Pauws (steffen.pauws@philips.com) [high]<br />
* Ozgur Izmirli (oizm@conncoll.edu) [moderate]<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg) [low]<br />
<br />
'''Symbolic Key-Finding''':<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi) [high]<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu) [high]<br />
* David Temperley (dtemp@theory.esm.rochester.edu) and Daniel Sleator (sleator@cs.cmu.edu) [high]<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi) [high]<br />
* Craig Sapp (craig@ccrma.stanford.edu) [moderate]<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed. <br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics. <br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings. <br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Audio Data''': Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
[ Perfe 02/24/05: I guess that having longer excerpts would be interesting too. If not, it could be the case that in those 30 seconds we would get a lot of modulations that blurr the results. On the other hand I would like to encourage the use of Magnatune and other Creative Commons license-based sites, and also the epitonic contents to overcome any "legal" issue.]<br />
<br />
<br />
'''MIDI Collections''': MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithmsΓÇÖ robustness to the various arrangements of instruments. <br />
<br />
[ Perfe 02/24/05: If we use some of those midi files to render audio versions (I would only use up to a 25% of midi-rendered files for the audio part of this contest), my suggestion is to "produce" the tracks a bit (i.e. adding reverb, doubling lines, etc.), specially if they are from pop music. "Simple" midi files, as pale renditions of a piece, will yield pale estimations of the capabilities of our algorithms]<br />
<br />
'''Score-based Collections''': Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan.<br />
So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". <br />
<br />
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on. <br />
<br />
<br />
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major. <br />
<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose. <br />
<br />
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].<br />
<br />
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]<br />
<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.<br />
<br />
<br />
==Emilia's Comments==<br />
<br />
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:<br />
<br />
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...<br />
<br />
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant? <br />
<br />
[Hendrik 02.26.05]: matlab would be very convenient. <br />
<br />
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.<br />
<br />
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.<br />
<br />
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation. <br />
<br />
Best regards and thanks,</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Melody_Extr&diff=1412005:Audio Melody Extr2005-02-12T18:01:02Z<p>138.37.33.58: /* Potential Participants */</p>
<hr />
<div>==Proposer==<br />
<br />
Graham Poliner (Columbia University) graham@ee.columbia.edu<br />
<br />
<br />
==Title==<br />
<br />
Melody Extraction of Polyphonic Audio<br />
<br />
<br />
==Description==<br />
<br />
The melodic content of polyphonic audio provides an intuitive representation for summarization and retrieval. Numerous potential approaches exist for automated melody extraction; therefore, the MIREX 2005 Melody Extraction Evaluation seeks to compare the accuracy of state-of-the-art melody transcription algorithms. The evaluation data set will consist of an eclectic collection of audio excerpts along with the corresponding frame-based transcription of the dominant voice. The performance of the submitted algorithms will be evaluated based on the percentage of frames correctly transcribed. <br />
<br />
<br />
==Potential Participants==<br />
<br />
*Juan P. Bello - juan.bello-correa@elec.qmul.ac.uk - Very Likely<br />
*Ali Taylan Cemgil - cemgil@science.uva.nl - Moderately Likely<br />
*Emilia Gomez - emilia.gomez@iua.upf.es - Likely<br />
*Masataka Goto - m.goto@aist.go.jp - Moderately Likely<br />
*Jana Eggink - j.eggink@dcs.shef.ac.uk - Moderately Likely<br />
*Anssi Klapuri - klap@cs.tut.fi - Moderately Likely<br />
*Matija Marolt - matija.marolt@fri.uni-lj.si - Moderately Likely<br />
*Rui Pedro Paiva - ruipedro@dei.uc.pt - Very Likely<br />
*Graham Poliner - graham@ee.columbia.edu - Very Likely<br />
*Sven Tappert - s_tappert@yahoo.de - Very Likely<br />
*Karin Dressler - dresslkn@idmt.fraunhofer.de - Likely<br />
*Matti Ryynänen - matti.ryynanen@tut.fi - Moderately Likely<br />
*Emmanuel Vincent - emmanuel.vincent@elec.qmul.ac.uk - Likely<br />
<br />
==Evaluation Procedures==<br />
<br />
Following the evaluation procedure specified for the ISMIR 2004 Melody Contest<br />
*Option 1 - A frame-based comparison between the predicted and reference melody<br />
The total prediction accuracy may be computed by calculating the average absolute difference for each frame where a maximal error is defined as one semitone = 100 cents and a value of 0 Hz may be assigned to unvoiced segments. <br />
*Option 2 - A frame-based comparison between the predicted and reference melody over a one-octave range<br />
This option is the same as Option 1; however, the predicted melody and reference melody are mapped into the range of one octave before calculating the absolute difference.<br />
*Option 3 - Edit distance between the estimated melody and the correct melody<br />
Following the edit distance calculation outlined in Grachten et al. 2002<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse set of audio segments and corresponding melody transcriptions. Due to the success of the ISMIR 2004 Melody Competition, we recommend that the evaluation set be reused and augmented with additional audio excerpts from such genres as pop, jazz, digital, and opera. The new ground truth may be created by manually correcting the output of current melody transcription algorithms. We may also wish to consider representing the genres in different proportions for the MIREX 2005 evaluation. <br />
The inclusion of popular music may result in additional copyright issues. Copyright law prohibits the universal or unlimited distribution of material on the web. However, if access to the media is limited to MIREX participants, this should be considered a fair use of the copyrighted materials.<br />
<br />
<br />
==Review 1==<br />
<br />
Problem is reasonably well defined and would be considered interesting in terms of current research.<br />
<br />
No mention of audio format/sampling rate, will assume:<br />
* CD-quality (CM, 16-bit, 44100 Hz)<br />
* mono<br />
* 30 seconds excerpts<br />
* files are named as "001.wav" to "999.wav"<br />
No mention of frame size or hop size, will this be the same as 2004 competition (Frame size 2048, hop size 256)? Is this optimal? Would some participants prefer to use different sizes. Could the proposed evaluation metrics be modified to use absolute time indexes and a tolerance and therefore be independent of framing?<br />
<br />
In the proposed evaluation metrics there is no mention of whether option 1 and option two will be averages as they were last year, or how option 3 will be combined with these.<br />
Statistical significance of differences between submissions should be estimated.<br />
<br />
Re-use and augmentation of last year's database is fine, however there is no mention of where new data will come from. Obviously the Magnatune database would be a good source, as this can also be distributed, however it may be best to distribute last years database and hold back new examples. How big should new database be? 50 files? I assume there are likely to be no trained submissions, or they will be pre-trained therefore a single pass over the data should be fine. There is also no mention of how many non-participating transcribers will produce the ground-truth and how differences in transcriptions will be resolved. Given IP status of Magnatune database, distribution to transcribers should not be a problem.<br />
<br />
Given the high number of potential participants, I think we can be confident of sufficient participation to run the evaluation.<br />
<br />
Recommendation: Significant refinements to proposal and accept.<br />
<br />
<br />
==Review 2==<br />
<br />
This problem is well defined and very relevant to MIR.<br />
<br />
The mentioned possible participants are really working in the field. However, the participants marked as "very likely" the same people that participated last year, while some key researchers in the field are modestly marked as "moderately likely". I believe that for this evaluation to be meaningful, the organizers should secure the participation of Masataka Goto (whose PreFest algorithm is still the main reference for melody extraction), Matija Marolt, Jana Eggink (both of whom published relevant work last year) and Anssi Klapuri (who has an extensive research record on relevant issues). Also, apart from Ali Taylan Cemgil, some of the people working in more Bayesian-based approaches to relevant problems are not mentioned: Chris Raphael (Indiana U), Samer Abdallah (Queen Mary, London), Randall Leistikow (Stanford U), Kunio Kashino (NTT Japan). It could be very interesting to have them on board.<br />
<br />
Regarding evaluation procedures, this contest has the advantage of having a precedent during last year's exercise. I would make a few suggestions from that experience:<br />
* UPF should make available any semi-automatic tool for evaluation used last year.<br />
* Each sound file to be used, should be cross-annotated, and the variability between annotations should be used for the evaluation.<br />
* 2 or more voice arrangements should be eliminated from the training/test set. In those there is no clear definition of the melody to be extracted.<br />
* There should be a separate evaluation for melody segmentation: how well the algorithm separates those excerpts containing melodic parts from those that are purely background. The evaluation can be similar to the one Marolt's paper for DAFx04.<br />
I would recommend the organizers to contact Emilia Gomez, Sebastian Strecht and Bee-Suan Ong from UPF, about last year's experience. We should learn from that experience and improve where necessary.<br />
<br />
Using the RWC database, Magnatunes and other similar collections, could help to expand the training and test sets. The organizers will need to coordinate a wide effort to expand on the currently existing contest database. Melody annotation is very complex and quite time-consuming, so only through a concerted effort will a proper test set be developed.<br />
The organizers could also contact Michele Lessaffre in Ghent, about their annotations efforts in the past (see ISMIR 2004).<br />
<br />
==Downie's Comments==<br />
<br />
1. The reviewers have summed up the issues very well. This is a hard task to evaluate completely and well. Can we come up with a "baby" version that we can do now while aiming toward a richer evaluation down the road?<br />
<br />
==Emmanuel's Comments==<br />
<br />
As a potential participant, I have two comments.<br />
<br />
* How can we measure the performance of an algorithm regarding fine identification of f0 if the target f0 is created with another algorithm ? This is not a ground truth ! I would better use the following error for option 1: error is equal to 0 whenever the predicted f0 is within 1/4 tone of the reference f0, and error is equal to 1 otherwise. This also solves the frame size issue, since the reference f0 may vary slightly depending on the frame size but not the discrete pitch. Another possibility would be to consider prediction of discrete (MIDI) pitch, which is sufficient for MIR applications and relevant as soon as all excerpts have the same reference pitch of 440 Hz (no ancient music then). Discrete events are needed anyway to compute the edit distance, aren't they ? (please insert a http link to the article describing the calculation of this distance)<br />
<br />
* The distinction between voiced/unvoiced (melody/accompaniment) segments is not very clear: in my opinion when the main melody is silent for a while, you hear another melody inside accompaniment. Last year melody was defined using training data from the same musical excerpts as test data, but this is not a good idea since it may lead to learn data-specific melody characteristics. I would like to use excerpts containing only clearly voiced portions and/or to define melody by its pitch range ("if the dominant pitch is between A and B then it is part of the melody"), so that no training set is needed to define melody.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Melody_Extr&diff=1402005:Audio Melody Extr2005-02-12T17:59:44Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Graham Poliner (Columbia University) graham@ee.columbia.edu<br />
<br />
<br />
==Title==<br />
<br />
Melody Extraction of Polyphonic Audio<br />
<br />
<br />
==Description==<br />
<br />
The melodic content of polyphonic audio provides an intuitive representation for summarization and retrieval. Numerous potential approaches exist for automated melody extraction; therefore, the MIREX 2005 Melody Extraction Evaluation seeks to compare the accuracy of state-of-the-art melody transcription algorithms. The evaluation data set will consist of an eclectic collection of audio excerpts along with the corresponding frame-based transcription of the dominant voice. The performance of the submitted algorithms will be evaluated based on the percentage of frames correctly transcribed. <br />
<br />
<br />
==Potential Participants==<br />
<br />
*Juan P. Bello - juan.bello-correa@elec.qmul.ac.uk - Very Likely<br />
*Ali Taylan Cemgil - cemgil@science.uva.nl - Moderately Likely<br />
*Emilia Gomez - emilia.gomez@iua.upf.es - Likely<br />
*Masataka Goto - m.goto@aist.go.jp - Moderately Likely<br />
*Jana Eggink - j.eggink@dcs.shef.ac.uk - Moderately Likely<br />
*Anssi Klapuri - klap@cs.tut.fi - Moderately Likely<br />
*Matija Marolt - matija.marolt@fri.uni-lj.si - Moderately Likely<br />
*Rui Pedro Paiva - ruipedro@dei.uc.pt - Very Likely<br />
*Graham Poliner - graham@ee.columbia.edu - Very Likely<br />
*Sven Tappert - s_tappert@yahoo.de - Very Likely<br />
*Karin Dressler - dresslkn@idmt.fraunhofer.de - Likely<br />
*Matti Ryynänen - matti.ryynanen@tut.fi - Moderately Likely<br />
<br />
==Evaluation Procedures==<br />
<br />
Following the evaluation procedure specified for the ISMIR 2004 Melody Contest<br />
*Option 1 - A frame-based comparison between the predicted and reference melody<br />
The total prediction accuracy may be computed by calculating the average absolute difference for each frame where a maximal error is defined as one semitone = 100 cents and a value of 0 Hz may be assigned to unvoiced segments. <br />
*Option 2 - A frame-based comparison between the predicted and reference melody over a one-octave range<br />
This option is the same as Option 1; however, the predicted melody and reference melody are mapped into the range of one octave before calculating the absolute difference.<br />
*Option 3 - Edit distance between the estimated melody and the correct melody<br />
Following the edit distance calculation outlined in Grachten et al. 2002<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse set of audio segments and corresponding melody transcriptions. Due to the success of the ISMIR 2004 Melody Competition, we recommend that the evaluation set be reused and augmented with additional audio excerpts from such genres as pop, jazz, digital, and opera. The new ground truth may be created by manually correcting the output of current melody transcription algorithms. We may also wish to consider representing the genres in different proportions for the MIREX 2005 evaluation. <br />
The inclusion of popular music may result in additional copyright issues. Copyright law prohibits the universal or unlimited distribution of material on the web. However, if access to the media is limited to MIREX participants, this should be considered a fair use of the copyrighted materials.<br />
<br />
<br />
==Review 1==<br />
<br />
Problem is reasonably well defined and would be considered interesting in terms of current research.<br />
<br />
No mention of audio format/sampling rate, will assume:<br />
* CD-quality (CM, 16-bit, 44100 Hz)<br />
* mono<br />
* 30 seconds excerpts<br />
* files are named as "001.wav" to "999.wav"<br />
No mention of frame size or hop size, will this be the same as 2004 competition (Frame size 2048, hop size 256)? Is this optimal? Would some participants prefer to use different sizes. Could the proposed evaluation metrics be modified to use absolute time indexes and a tolerance and therefore be independent of framing?<br />
<br />
In the proposed evaluation metrics there is no mention of whether option 1 and option two will be averages as they were last year, or how option 3 will be combined with these.<br />
Statistical significance of differences between submissions should be estimated.<br />
<br />
Re-use and augmentation of last year's database is fine, however there is no mention of where new data will come from. Obviously the Magnatune database would be a good source, as this can also be distributed, however it may be best to distribute last years database and hold back new examples. How big should new database be? 50 files? I assume there are likely to be no trained submissions, or they will be pre-trained therefore a single pass over the data should be fine. There is also no mention of how many non-participating transcribers will produce the ground-truth and how differences in transcriptions will be resolved. Given IP status of Magnatune database, distribution to transcribers should not be a problem.<br />
<br />
Given the high number of potential participants, I think we can be confident of sufficient participation to run the evaluation.<br />
<br />
Recommendation: Significant refinements to proposal and accept.<br />
<br />
<br />
==Review 2==<br />
<br />
This problem is well defined and very relevant to MIR.<br />
<br />
The mentioned possible participants are really working in the field. However, the participants marked as "very likely" the same people that participated last year, while some key researchers in the field are modestly marked as "moderately likely". I believe that for this evaluation to be meaningful, the organizers should secure the participation of Masataka Goto (whose PreFest algorithm is still the main reference for melody extraction), Matija Marolt, Jana Eggink (both of whom published relevant work last year) and Anssi Klapuri (who has an extensive research record on relevant issues). Also, apart from Ali Taylan Cemgil, some of the people working in more Bayesian-based approaches to relevant problems are not mentioned: Chris Raphael (Indiana U), Samer Abdallah (Queen Mary, London), Randall Leistikow (Stanford U), Kunio Kashino (NTT Japan). It could be very interesting to have them on board.<br />
<br />
Regarding evaluation procedures, this contest has the advantage of having a precedent during last year's exercise. I would make a few suggestions from that experience:<br />
* UPF should make available any semi-automatic tool for evaluation used last year.<br />
* Each sound file to be used, should be cross-annotated, and the variability between annotations should be used for the evaluation.<br />
* 2 or more voice arrangements should be eliminated from the training/test set. In those there is no clear definition of the melody to be extracted.<br />
* There should be a separate evaluation for melody segmentation: how well the algorithm separates those excerpts containing melodic parts from those that are purely background. The evaluation can be similar to the one Marolt's paper for DAFx04.<br />
I would recommend the organizers to contact Emilia Gomez, Sebastian Strecht and Bee-Suan Ong from UPF, about last year's experience. We should learn from that experience and improve where necessary.<br />
<br />
Using the RWC database, Magnatunes and other similar collections, could help to expand the training and test sets. The organizers will need to coordinate a wide effort to expand on the currently existing contest database. Melody annotation is very complex and quite time-consuming, so only through a concerted effort will a proper test set be developed.<br />
The organizers could also contact Michele Lessaffre in Ghent, about their annotations efforts in the past (see ISMIR 2004).<br />
<br />
==Downie's Comments==<br />
<br />
1. The reviewers have summed up the issues very well. This is a hard task to evaluate completely and well. Can we come up with a "baby" version that we can do now while aiming toward a richer evaluation down the road?<br />
<br />
==Emmanuel's Comments==<br />
<br />
As a potential participant, I have two comments.<br />
<br />
* How can we measure the performance of an algorithm regarding fine identification of f0 if the target f0 is created with another algorithm ? This is not a ground truth ! I would better use the following error for option 1: error is equal to 0 whenever the predicted f0 is within 1/4 tone of the reference f0, and error is equal to 1 otherwise. This also solves the frame size issue, since the reference f0 may vary slightly depending on the frame size but not the discrete pitch. Another possibility would be to consider prediction of discrete (MIDI) pitch, which is sufficient for MIR applications and relevant as soon as all excerpts have the same reference pitch of 440 Hz (no ancient music then). Discrete events are needed anyway to compute the edit distance, aren't they ? (please insert a http link to the article describing the calculation of this distance)<br />
<br />
* The distinction between voiced/unvoiced (melody/accompaniment) segments is not very clear: in my opinion when the main melody is silent for a while, you hear another melody inside accompaniment. Last year melody was defined using training data from the same musical excerpts as test data, but this is not a good idea since it may lead to learn data-specific melody characteristics. I would like to use excerpts containing only clearly voiced portions and/or to define melody by its pitch range ("if the dominant pitch is between A and B then it is part of the melody"), so that no training set is needed to define melody.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_and_Symbolic_Key&diff=7782005:Audio and Symbolic Key2005-02-09T10:56:21Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.<br />
<br />
<br />
==Potential Participants==<br />
<br />
'''Audio Key-Finding''':<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].<br />
* Steffen Pauws (steffen.pauws@philips.com): [high].<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Ozgur Izmirli (oizm@conncoll.edu): [moderate].<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].<br />
* Hendrik Purwins (hendrik@cs.tu-berlin.de): [unknown].<br />
<br />
<br />
'''Symbolic Key-Finding''':<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi): [high]. <br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi): [high]. <br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu): [high]. <br />
* Craig Sapp (craig@ccrma.stanford.edu): [moderate]. <br />
* David Temperley (dtemp@theory.esm.rochester.edu): [unknown].<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
'''Audio Data''': Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
<br />
'''MIDI Collections''': MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithmsΓÇÖ robustness to the various arrangements of instruments. <br />
<br />
<br />
'''Score-based Collections''': Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.<br />
<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known. <br />
<br />
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres. <br />
<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
[Arpi 02.08.05]: Great idea!<br />
<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.<br />
<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
[Arpi 02.08.05]: Thank you. This has been corrected.<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
<br />
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).<br />
<br />
<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
<br />
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.<br />
<br />
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.<br />
<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
<br />
<br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.<br />
<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.<br />
<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.<br />
<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.<br />
<br />
<br />
==Arpi's Comments==<br />
<br />
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Main_Page&diff=212005:Main Page2005-02-09T10:55:44Z<p>138.37.33.58: /* Topics */</p>
<hr />
<div>==Welcome to the MIREX Wiki.== <br />
<br />
* MIREX 2005 (2nd Annual Music Information Retrieval Evaluation eXchange): https://www.music-ir.org/evaluation/MIREX/index.html<br />
* Call for evaluation topic: https://www.music-ir.org/evaluation/MIREX/call_for_evaluation_topics.html<br />
* Call for test data and evaluation procedures: https://www.music-ir.org/evaluation/MIREX/call_for_data_and_procedures.html<br />
<br />
<br />
==Topics==<br />
<br />
* [[Audio Artist Identification]] <br />
* [[Audio Drum Detection]] <br />
* [[Audio Genre Classification]] <br />
* [[Audio Melody Extraction]] <br />
* [[Audio Onset Detection]] <br />
* [[Audio Tempo Extraction]] <br />
* [[Audio and Symbolic Key Finding]]<br />
* [[Symbolic Genre Classification]] <br />
* [[Symbolic Melodic Similarity]]<br />
<br />
==Editing Resources==<br />
<br />
Please see: <br />
<br />
* MediaWiki: [http://meta.wikipedia.org/wiki/MediaWiki_User%27s_Guide User's Guide]<br />
* MediaWiki: [http://www.wikipedia.org/wiki/Help:Editing Editing Help]<br />
<br />
<br />
==Other External Links==<br />
<br />
*M2K: https://music-ir.org/evaluation/m2k/index.html<br />
*M2K modules webpage: https://music-ir.org/evaluation/m2k/module_listing.html<br />
*M2K Modules Wiki: https://www.music-ir.org/modules<br />
*The Tools We Use: https://music-ir.org/evaluation/tools.html<br />
*IMIRSEL: https://music-ir.org/evaluation/<br />
*Music-IR Bibliography: https://music-ir.org/research_home.html<br />
*Music-IR.org: https://music-ir.org/</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Key_Finding&diff=3522005:Audio Key Finding2005-02-04T12:48:12Z<p>138.37.33.58: /* Emmanuel's Comments */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for MIDI data. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments. <br />
<br />
<br />
==Potential Participants==<br />
<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].<br />
* Steffen Pauws (steffen.pauws@philips.com): [high].<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Ozgur Izmirli (oizm@conncoll.edu): [moderate].<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Key_Finding&diff=2652005:Symbolic Key Finding2005-02-04T12:40:11Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for audio data.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi): [high].<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi): [high].<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Craig Sapp (craig@ccrma.stanford.edu): [moderate].<br />
* David Temperley (dtemp@theory.esm.rochester.edu): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
MIDI Collections: MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com ) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithms' robustness to the various arrangements of instruments. <br />
<br />
Score-based Collections: Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org ), for example, provides access to such a score-based collection.<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
==Review 2==<br />
<br />
General comment: Title - Evaluation of Key Finding Algorithms Using Symbolic Data or Evaluation of Key Finding Algorithms Part 2<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
The test data are relevant for the problem. For the classical archive web-site, permission would have to be obtained for large downloads at a go, else progressive downloads would have to be worked on (since only a certain number of performances is allowed for downloads in a day).<br />
<br />
==Downie's Comments==<br />
<br />
Please see my Audio Key Finding comments.<br />
<br />
==Emmanuel's Comments==<br />
<br />
Please see my Audio Key Finding comments.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Key_Finding&diff=3512005:Audio Key Finding2005-02-04T12:38:46Z<p>138.37.33.58: /* Emmanuel's Comments */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for MIDI data. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments. <br />
<br />
<br />
==Potential Participants==<br />
<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].<br />
* Steffen Pauws (steffen.pauws@philips.com): [high].<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Ozgur Izmirli (oizm@conncoll.edu): [moderate].<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
==Emmanuel's Comments==<br />
<br />
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involved completely separate data and separate participants. From the committee point of view, this needed as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.<br />
<br />
Of course, I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Key_Finding&diff=3502005:Audio Key Finding2005-02-04T12:09:20Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for MIDI data. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments. <br />
<br />
<br />
==Potential Participants==<br />
<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].<br />
* Steffen Pauws (steffen.pauws@philips.com): [high].<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Ozgur Izmirli (oizm@conncoll.edu): [moderate].<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?<br />
<br />
==Downie's Comments==<br />
<br />
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed. <br />
<br />
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.<br />
<br />
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.<br />
<br />
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!<br />
<br />
==Emmanuel's Comments==</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Genre&diff=1872005:Audio Genre2005-02-02T14:18:13Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk<br />
<br />
<br />
==Title==<br />
<br />
Genre Classification from polyphonic audio.<br />
<br />
<br />
==Description==<br />
<br />
The automatic classification of polyphonic musical audio (in PCM format) into a single high-level genre per example. If there is sufficient demand, a multiple genre track could be defined, requiring submissions to identify each genre (without prior knowledge of the number of labels), with the precision and recall scores calculated for each result.<br />
<br />
1) Input data<br />
The input for this task is a set of sound file excerpts adhering to the format, metadata and content requirements mentioned below.<br />
<br />
Audio format:<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* Either whole files or 1 minute excerpts<br />
<br />
Audio content:<br />
* polyphonic music<br />
* data set should include at least 8 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Ballroom Dance, Electronic/Modern Dance, Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock)<br />
* the classification could also be evaluated in two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).<br />
* both live performances and sequenced music are eligible<br />
* Each class should be represented by a minimum of 100 examples, but 150 would be preferred. If possible the same number of examples should represent each class.<br />
* If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.<br />
<br />
Metadata:<br />
* By definition each example must have a genre label corresponding to one of the output classes.<br />
* Where possible existing genre labels should be confirmed by two or more non-entrants, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants.<br />
* The training set should be defined by a text file with one entry per line, in the following format:<br />
<example path and filename>\t<genre label>\n<br />
<br />
2) Output results<br />
Results should be output into a text file with one entry per line in the following format:<br />
<br />
<example path and filename>\t<genre classification>\n<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, High<br />
* Elias Pampalk (ÖFAI), elias@oefai.at, High<br />
* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, High<br />
* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High<br />
* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, High<br />
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium<br />
* François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium<br />
<br />
<br />
==Evaluation Procedures==<br />
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.<br />
<br />
Evaluation measures:<br />
* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).<br />
* Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.<br />
<br />
Evaluation framework:<br />
<br />
Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early Febuary for submission development.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Re-use Magnatune database (???)<br />
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)<br />
Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments)<br />
Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites<br />
<br />
Ground truth annotations:<br />
<br />
All annotations should be validated, rather than accepting supplied genre labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be exended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.<br />
<br />
<br />
==Review 1==<br />
<br />
The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples. Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.<br />
<br />
The tasks are well-defined, easily understood, and appear to have some practical importance. The evaluation and testing procedures are very good. This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.<br />
<br />
My only comments relate to the choice of ground truth data. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution. If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders. (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)<br />
<br />
My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.<br />
<br />
In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results. In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete. Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat. <br />
<br />
Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions. However, having each site run their own classifiers is a huge win in terms of the logistics of running the test. I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.<br />
<br />
I'm not sure how important the M2K/D2K angle is. It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation. By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding. <br />
<br />
In terms of the genre contest, the big issue is the unreliability and unclear definitions of the ground truth labels. It seems weird to have one evaluation on the ability to distinguish an arbitrary set of artists - a very general-sounding problem - and another contest which is specifically dominated by the ability to distinguish classical from jazz from rock - a very specific, and perhaps not very important, problem. <br />
<br />
Again in this case I don't particularly like the idea of trying to get multiple labellings: for artists, I thought it was unnecessary because agreement will be very high. Here, I think it's of dubious value because agreement will be so low; in both cases, errors in ground truth impact all participants equally, and so are not really a concern - we are mostly interested in relative values, so a ceiling on absolute performance due to a few 'incorrect' reference labels is of little consequence. <br />
<br />
Clearly, we can run a genre contest: I would again advocate for real music, and not worry too much about copyright issues, and not even worry too much about where the genre ground truth comes from, since it is always pretty suspect; allmusic.com is as good a source as any. But I personally find this contest of less intellectual interest than artist ID, even though it has historically received more attention, because of the poor definition of the true, underlying classes. <br />
<br />
I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms. It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).<br />
<br />
==Review 2==<br />
<br />
The single genre problem is well defined and seems to be a relevant problem for the MIR community nowadays. Obviously, it would be more relevant to classify each track into multiple genres or to use a hierarchy of genres, but the proposal does not deal with these issues in a satisfying way. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are nodes for Electronic and Jazz/Blues, where lies the leaf Electro-jazz ?<br />
I suggest that the contest concentrates on the well-defined simple genre problem. An interesting development of it would be to ask algorithms to associate a percentage of probability to each predefined genre on each track, instead of outputing a single genre with 100% probability.<br />
Regarding the input format, I think that whole files are better (the total duration and the volume variation are already good genre descriptors) and that polyphony is not required (classical music contains many works for solo instruments).<br />
<br />
I have no precise opinion regarding the defined genres, since this is more of a cultural importance. I'm not sure that Rock is less diverse than World (what's the common point between Elvis and Radiohead ?). Also I am surprised that there is no Rap/RnB.<br />
The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Thus I would like this choice to be more discussed by the participants.<br />
<br />
The list of participants is relevant. McKinney and Breebart could be added.<br />
<br />
It is a good idea to accept many programming languages for submission. However it seems quite difficult to implement the learning phase, because each algorithm may use different structures to store learnt data. For instance, when the algorithm computes descriptors and feeds them through a classifier, is it possible to select the best descriptors ? If not, it is not realistic to suppose that the participant has to do it beforehand on his own limited set of data. Then I see two possibilities: either participants are given 50% of the database and do all the learning work themselves (then no k-fold cross validation is performed), or submissions concern only sets of descriptors and not full classification algorithms. The second choice has the advantage of allowing to compare different sets of descriptors with the same classifiers.<br />
<br />
The test data are relevant but still a bit vague. Obviously existing databases should be used again and completed with new annotated data. The participants should list their own databases in detail and put them in common for evaluation in order to evaluate the time needed to annotate new data.<br />
<br />
==Downie's Comment==<br />
<br />
1. Think genre tasks are kinda fun, actually. Devil is in the details. Would give my eye teeth to avoid manually labelling genre classes. You set up eight classes with 100-150 examples. That comes to 800-1200 labels that need applying. Can we as a group come up with a possible standardized source for genre labels and then, even though they are not perfect, live with our choice? Perhaps in this early days, we would be best served by looking at only the broadest of categories and not fussing about the fine-grained subdivisions?<br />
<br />
2. Would be interesting to have a TRUE genre task! As we learned in the UPF doctoral seminar prior to ISMIR 2004, genre is properly defined as the "use" of the music: dance, liturgical, funereal, etc. What we are calling genre here is really style. Just a thought.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Artist&diff=872005:Audio Artist2005-02-02T14:17:31Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk<br />
<br />
<br />
==Title==<br />
<br />
Artist or group identification from musical audio.<br />
<br />
<br />
==Description==<br />
<br />
The automatic artist identification of musical audio.<br />
<br />
1) Input data<br />
The input for this task is a set of sound file excerpts adhering to the format, meta data and content requirements mentioned below.<br />
<br />
Audio format:<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* Either whole files or 1 minute excerpts<br />
<br />
Audio content:<br />
* Any type of music<br />
* data set should include at least 25 different artists or groups working in any genre<br />
* both live performances and sequenced music are eligible<br />
* Each artist should be represented by a minimum of 10 examples. If possible the same number of examples should represent each artist.<br />
* If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.<br />
* Would be good to enforce some sort of cross-album component for the actual contest to avoid producer detection<br />
<br />
Metadata:<br />
* By definition each example must have an artist or group label corresponding to one of the output classes.<br />
* It is assumed that artist labels will be correct, however, where possible existing artist labels should be confirmed by two or more non-entrants, due to IP constraints it is unlikely that we will be allowed to distribute any database for metadata validation by participants. This validation should ensure that each artist or group has a single label which is applied to all of their examples and that any conflicts, such as an artist also belonging to a group also represented within the data, are resolved/removed for simplicity. Other possibilities include allowing multiple artist labels, and requiring submissions to identify each label, with the final score divided evenly among the labels (I doubt there is demand for this).<br />
* The training set should be defined by a text file with one entry per line, in the following format:<br />
<example path and filename>\t<genre label>\n<br />
<br />
2) Output results<br />
Results should be output into a text file with one entry per line in the following format:<br />
<example path and filename>\t<genre classification>\n<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, Medium<br />
* Elias Pampalk (ÖFAI), elias@oefai.at, Medium<br />
* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, Medium<br />
* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High<br />
* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, Medium<br />
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium<br />
* François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium<br />
<br />
<br />
==Evaluation Procedures==<br />
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.<br />
<br />
Evaluation measures:<br />
* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).<br />
* Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.<br />
* Perhaps specify different class #s (1-in-10, 1-in-50, 1-in-1000) to test scaling and robustness among different implementations<br />
<br />
Evaluation framework:<br />
<br />
Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early February for submission development.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
(Note potentially significant data overlap between this task and genre classification competition)<br />
Re-use Magnatune database (???)<br />
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)<br />
Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments)<br />
Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites<br />
<br />
Ground truth annotations:<br />
<br />
All annotations should be validated, to ensure homogenenuity of artist labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be extended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.<br />
<br />
<br />
==Review 1==<br />
<br />
The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples. Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.<br />
<br />
The tasks are well-defined, easily understood, and appear to have some practical importance. The evaluation and testing procedures are very good. This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.<br />
<br />
My only comments relate to the choice of ground truth data. For the artist ID task I think we should use real, commercial recordings, since there is no shortage of them, and the artist ground truth is easily defined. I do not think it is important to have independent verification of the ground truth, since there will be enough examples to ensure that a few questionable cases won't much hurt the overall performance, and in any case all we really care about is comparative performance. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution. If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders. (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)<br />
<br />
My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.<br />
<br />
In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results. In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete. Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat. <br />
<br />
Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions. However, having each site run their own classifiers is a huge win in terms of the logistics of running the test. I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.<br />
<br />
I am particularly interested in a set of contrast conditions for different scales of problem - 1 in 10, 1 in 50, 1 in 100 etc. Most artist ID tasks have been on very small subsets of 'all possible artists', and it would be interesting to see if there are differences in how different approaches scale (e.g. that only some techniques are tractable for very large sets).<br />
<br />
I also think that a cross-album condition is particularly interesting. Again, this could be a contrast: for each artist, have training data from albums A and B, then have (disjoint) test data from albums B and C, and compare the accuracy on both cases to see how strong the 'producer effect' (or within-album similarity) really is. <br />
<br />
I'm not sure how important the M2K/D2K angle is. It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation. By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding. <br />
<br />
I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms. It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).<br />
<br />
==Review 2==<br />
<br />
This proposal is very interesting and it is one the most well defined. Indeed it seems quite straightforward to establish the ground truth and to evaluate the results.<br />
<br />
The mentioned participants really belong to the field. People working on voice separation could be added, such as Feng, Zhuang & Pan and Tsai & Wang.<br />
<br />
The test data are also relevant and seem easy to obtain. The RWC database could also provide some data. However I don't think that data synthesized from MIDI can be used (to avoid the "MIDI-producer" detection).<br />
<br />
My main concern is about the range of genres spanned by the data. Indeed, if most data come from different genres, the problem becomes far easier and less relevant. I believe that artist identification and artist similarity (which is close to genre classification) are very different queries, and that artist identification is relevant only within a given genre.<br />
Thus I would like to perform the evaluation on one of two sets of artists belonging to a single genre (say classical or rock) and containing some very similar artists (say Mozart/Haydn/Gluck or The beatles/The rolling stones/The who).<br />
<br />
==Downie's Comments==<br />
<br />
Review #2 does raise the interesting point of too much spread in the "genre" aspect. I do see how it could turn into a <br />
genre task if not thought out. Would be interesting to also add in the idea of "covers": same pieces but performed by different artists. Maybe, if possible, a mix of "live" and "studio" recordings of same pieces if available?<br />
<br />
Some questions:<br />
<br />
1. Why PCM? Why mono? Why not MP3? Am being a bit of a weeny, but I am interested.<br />
<br />
2. Do we **really* need to supply the training set? Being both provocative and pragmatic with this question.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Genre&diff=1842005:Audio Genre2005-02-01T20:46:32Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk<br />
<br />
<br />
==Title==<br />
<br />
Genre Classification from polyphonic audio.<br />
<br />
<br />
==Description==<br />
<br />
The automatic classification of polyphonic musical audio (in PCM format) into a single high-level genre per example. If there is sufficient demand, a multiple genre track could be defined, requiring submissions to identify each genre (without prior knowledge of the number of labels), with the precision and recall scores calculated for each result.<br />
<br />
1) Input data<br />
The input for this task is a set of sound file excerpts adhering to the format, metadata and content requirements mentioned below.<br />
<br />
Audio format:<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* Either whole files or 1 minute excerpts<br />
<br />
Audio content:<br />
* polyphonic music<br />
* data set should include at least 8 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Ballroom Dance, Electronic/Modern Dance, Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock)<br />
* the classification could also be evaluated in two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).<br />
* both live performances and sequenced music are eligible<br />
* Each class should be represented by a minimum of 100 examples, but 150 would be preferred. If possible the same number of examples should represent each class.<br />
* If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.<br />
<br />
Metadata:<br />
* By definition each example must have a genre label corresponding to one of the output classes.<br />
* Where possible existing genre labels should be confirmed by two or more non-entrants, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants.<br />
* The training set should be defined by a text file with one entry per line, in the following format:<br />
<example path and filename>\t<genre label>\n<br />
<br />
2) Output results<br />
Results should be output into a text file with one entry per line in the following format:<br />
<br />
<example path and filename>\t<genre classification>\n<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, High<br />
* Elias Pampalk (ÖFAI), elias@oefai.at, High<br />
* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, High<br />
* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High<br />
* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, High<br />
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium<br />
* François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium<br />
<br />
<br />
==Evaluation Procedures==<br />
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.<br />
<br />
Evaluation measures:<br />
* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).<br />
* Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.<br />
<br />
Evaluation framework:<br />
<br />
Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early Febuary for submission development.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Re-use Magnatune database (???)<br />
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)<br />
Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments)<br />
Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites<br />
<br />
Ground truth annotations:<br />
<br />
All annotations should be validated, rather than accepting supplied genre labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be exended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.<br />
<br />
==Review 1==<br />
<br />
<br />
==Review 2==<br />
<br />
The single genre problem is well defined and seems to be a relevant problem for the MIR community nowadays. Obviously, it would be more relevant to classify each track into multiple genres or to use a hierarchy of genres, but the proposal does not deal with these issues in a satisfying way. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are nodes for Electronic and Jazz/Blues, where lies the leaf Electro-jazz ?<br />
I suggest that the contest concentrates on the well-defined simple genre problem. An interesting development of it would be to ask algorithms to associate a percentage of probability to each predefined genre on each track, instead of outputing a single genre with 100% probability.<br />
Regarding the input format, I think that whole files are better (the total duration and the volume variation are already good genre descriptors) and that polyphony is not required (classical music contains many works for solo instruments).<br />
<br />
I have no precise opinion regarding the defined genres, since this is more of a cultural importance. I'm not sure that Rock is less diverse than World (what's the common point between Elvis and Radiohead ?). Also I am surprised that there is no Rap/RnB.<br />
The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Thus I would like this choice to be more discussed by the participants.<br />
<br />
The list of participants is relevant. McKinney and Breebart could be added.<br />
<br />
It is a good idea to accept many programming languages for submission. However it seems quite difficult to implement the learning phase, because each algorithm may use different structures to store learnt data. For instance, when the algorithm computes descriptors and feeds them through a classifier, is it possible to select the best descriptors ? If not, it is not realistic to suppose that the participant has to do it beforehand on his own limited set of data. Then I see two possibilities: either participants are given 50% of the database and do all the learning work themselves (then no k-fold cross validation is performed), or submissions concern only sets of descriptors and not full classification algorithms. The second choice has the advantage of allowing to compare different sets of descriptors with the same classifiers.<br />
<br />
The test data are relevant but still a bit vague. Obviously existing databases should be used again and completed with new annotated data. The participants should list their own databases in detail and put them in common for evaluation in order to evaluate the time needed to annotate new data.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Melodic&diff=6272005:Symbolic Melodic2005-02-01T20:45:57Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Rainer Typke (Universiteit Utrecht) rtypke@cs.uu.nl<br />
<br />
<br />
==Title==<br />
<br />
Retrieving melodically similar incipits from 500000 notated real world compositions<br />
<br />
<br />
==Description==<br />
<br />
Retrieving the most similar incipits from the RISM A/II collection, given one of the incipits as a query.<br />
<br />
Expected results: For 11 queries, a ground truth has been established (see Typke et al. 2005, http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/TR_groundtruth.pdf ). This ground truth makes it possible to measure the search results in an absolute way, taking into consideration not only relevance, but also the correct ranking.<br />
This ground truth has been obtained by combining ranked lists that were created by 35 music experts. The resulting ground truth has the form of ranked groups of incipits; the groups contain incipits whose differences in rankings were not statistically significant, but the ranking of the groups is statistically significant. MIR systems that perform well should return incipits such that the members of the groups are retrieved without violating the order of the groups.<br />
By using this ground truth, an absolute measure of quality is attainable. However, there is a danger of overfitting algorithms to just the 11 queries for which a ground truth is known. Therefore, there should also be an evaluation based on another set of queries that are chosen only after the algorithms have been submitted. This second part of the evaluation would only allow a comparison, no absolute measurement, but any dependance of the algorithms on the queries could be avoided.<br />
<br />
<br />
==Potential Participants==<br />
<br />
Tuomas Eerola/Petri Toiviainen Jyväskylä, Finland<br />
J├╝rgen Kilian/Holger Hoos Darmstadt/British Columbia, Canada<br />
Shyamala Doraisamy/Stefan Rueger Malaysia/London<br />
Maarten Grachten/Josep Lluis Arcos/Ramón López de Mántaras Bellaterra,<br />
Spain<br />
Giovanna Neve/Nicola Orio Pado Padova, Italia<br />
Anna Pienimäki/Kjell Lemström Helsinki, Finland<br />
Craig Sapp/Yi Wen Liu/Eleanor Selfridge-Field Stanford, US<br />
Daniel M├╝llensiefen/Klaus Frieler Hamburg<br />
Anna Lubiw/Luke Tanur Waterloo, Canada<br />
Michael Clausen Bonn<br />
Alexandra Uitdenbogerd, RMIT<br />
Rainer Typke, Frans Wiering, Remco C. Veltkamp, Universiteit Utrecht,<br />
{rainer.typke|frans.wiering|remcov}@cs.uu.nl<br />
Jeremy Pickens - jeremy@dcs.kcl.ac.uk<br />
Tim Crawford - t.crawford@gold.ac.uk<br />
<br />
The likelihood of entering is unknown for all groups except Utrecht and Goldsmiths/King's; for Utrecht and Goldsmiths/King's, it is high. The other groups have developed algorithms that would be interesting to compare, so their likelihood of entering is not too low.<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
1) Absolute measurement using the ground truth<br />
For each of the 11 queries from the ground truth, we have a certain number of groups of incipits, where we know the correct ranking of the groups. Let us assume that the total number of incipits in these groups is N. For each ranking returned by a method to be evaluated, take the top N incipits. At every border between groups, calculate the precision. To get one single measure, one can now integrate over this precision curve and divide the result by N. This measure will be a number between 0 and 1, where 0 means total failure and 1 means complete agreement with the ground truth. This measure will reflect not only the ability of algorithms to find as many relevant incipits as possible, but also to order them correctly.<br />
<br />
For example, let us assume that the ground truth has delivered the following three groups, with N=6:<br />
(a,b),(c,d,e),(f).<br />
<br />
Method X returns:<br />
b,a, g,c,d, e.<br />
<br />
Since there are 3 groups, we determine precision and recall at three points:<br />
<br />
after position 2: precision 1 (because both relevant documents a and b have been found. The correct order of a and b is not known since they are in the same group, therefore no penalty should be applied for them being in the opposite order).<br />
<br />
after position 5: precision 4/5=0.8 (because only g should not be among the first 5 incipits)<br />
<br />
after position 6: precision 5/6=0.83 (because only g should not be among the first 6)<br />
<br />
For an illustration, see the last 3 slides of http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/dir05pres.pdf<br />
<br />
2) Comparison of algorithms without using the ground truth<br />
To avoid the overfitting of the algorithms to the ground truth, select another group of queries after all algorithms have been submitted. Put all incipits returned by the algorithms into one pool. Remove doubles from the pool, divide the remaining incipits into portions of equal sizes, and distribute them back to the participants for relevance judgements (on a binary or ternary scale). Once the relevance is known for all returned incipits, use standard TREC measures for comparing the algorithms. This second comparison has some disadvantages: the complete set of relevant incipits won't be known, and neither will be the correct ranking. But the main disadvantage of the first comparison, the fact that the correct answer is known in advance, is avoided.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
RISM A/II collection (large collection of real-world compositions for which a ground truth has already been established).<br />
Possible copyright issue: Permission from RISM for distributing the test collection or parts of it to participants and using it for EvalFest is probably necessary.<br />
Obtaining the data is not difficult: the RISM CD is available in good libraries, and the data can be exported from the CD in plaine&easie format. Alternatively, the existing result of converting the data into MusicXML could be distributed.<br />
If RISM does not agree to the data being distributed, maybe it is possible to convince them to at least agree to a solution similar to the Naxos audio collection, where the data are quarantained, and only the software travels.<br />
<br />
==Review 1==<br />
<br />
Melodic similarity would have to be precisely defined, i.e. how was the ground truth established. <br />
<br />
The mentioned possible participants are really likely to participate (to my knowledge at least 10 out of the 14 listed).<br />
<br />
Part 1 - Absolute measurement using ground truth<br />
Can the number of queries be increased? 11 queries for a collection of 50000 incipits seem rather small (This comment can be ignored if there is difficulty in establishing the ground truth from such a large collection). <br />
Part 2 - Comparison of algorithms without using the ground truth <br />
Any ideas of how queries would be obtained ? Real-world queries ? Queries pooled from each participant ?<br />
<br />
The relevance assumptions have to be explicitly stated for the judging process by the various participants using this pooling approach.<br />
If ternary scale is used, details of this scale would be needed.<br />
What evaluation measure would be adopted from TREC?<br />
<br />
==Review 2==<br />
<br />
A very well- thought out proposal. First off, he has a database in mind, a large list of participants and thought carefully about evaluation. However, the number of likely participants is probably not as high as he believes and there are still issues with evaluation.<br />
<br />
I'm not totally clear about some aspects. These 500000 notated real world compositions- are they audio files or midi? How long are they and are they mono or polyphonic? How is the notation done? Judging from most of the names of participants, I'd guess he's talking about midi.<br />
<br />
His set of 11 queries with groundtruth is actually quite small, and there is still considerable work involved in getting groundtruth for other queries, especially since it requires feedback from many music experts.<br />
<br />
His method of evaluation through use of a groundtruth established by music experts should be compared with Jeremy's suggestion (check previous MIREX postings or just communicate with him directly) of using variations to identify melodic similarity.<br />
<br />
His section on evaluation is quite good, and precision/recall is the preferred method in the Info Retrieval community (just ask Stephen). Both algorithms seem feasible assuming the other issues have been worked out. If possible, both his suggested evaluation procedures should be implemented. Would be interesting to see if they give similar results, and if not, why?<br />
If the RISM A/II collection is really available from many libraries, then most of the work is done. Stephen's group can keep it quarantined and just run the algorithms that are provided. I think we don't even need to ask RISM for permission.<br />
<br />
I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that we have no problems with RISM.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Melodic&diff=6262005:Symbolic Melodic2005-02-01T20:45:49Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Rainer Typke (Universiteit Utrecht) rtypke@cs.uu.nl<br />
<br />
<br />
==Title==<br />
<br />
Retrieving melodically similar incipits from 500000 notated real world compositions<br />
<br />
<br />
==Description==<br />
<br />
Retrieving the most similar incipits from the RISM A/II collection, given one of the incipits as a query.<br />
<br />
Expected results: For 11 queries, a ground truth has been established (see Typke et al. 2005, http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/TR_groundtruth.pdf ). This ground truth makes it possible to measure the search results in an absolute way, taking into consideration not only relevance, but also the correct ranking.<br />
This ground truth has been obtained by combining ranked lists that were created by 35 music experts. The resulting ground truth has the form of ranked groups of incipits; the groups contain incipits whose differences in rankings were not statistically significant, but the ranking of the groups is statistically significant. MIR systems that perform well should return incipits such that the members of the groups are retrieved without violating the order of the groups.<br />
By using this ground truth, an absolute measure of quality is attainable. However, there is a danger of overfitting algorithms to just the 11 queries for which a ground truth is known. Therefore, there should also be an evaluation based on another set of queries that are chosen only after the algorithms have been submitted. This second part of the evaluation would only allow a comparison, no absolute measurement, but any dependance of the algorithms on the queries could be avoided.<br />
<br />
<br />
==Potential Participants==<br />
<br />
Tuomas Eerola/Petri Toiviainen Jyväskylä, Finland<br />
J├╝rgen Kilian/Holger Hoos Darmstadt/British Columbia, Canada<br />
Shyamala Doraisamy/Stefan Rueger Malaysia/London<br />
Maarten Grachten/Josep Lluis Arcos/Ramón López de Mántaras Bellaterra,<br />
Spain<br />
Giovanna Neve/Nicola Orio Pado Padova, Italia<br />
Anna Pienimäki/Kjell Lemström Helsinki, Finland<br />
Craig Sapp/Yi Wen Liu/Eleanor Selfridge-Field Stanford, US<br />
Daniel M├╝llensiefen/Klaus Frieler Hamburg<br />
Anna Lubiw/Luke Tanur Waterloo, Canada<br />
Michael Clausen Bonn<br />
Alexandra Uitdenbogerd, RMIT<br />
Rainer Typke, Frans Wiering, Remco C. Veltkamp, Universiteit Utrecht,<br />
{rainer.typke|frans.wiering|remcov}@cs.uu.nl<br />
Jeremy Pickens - jeremy@dcs.kcl.ac.uk<br />
Tim Crawford - t.crawford@gold.ac.uk<br />
<br />
The likelihood of entering is unknown for all groups except Utrecht and Goldsmiths/King's; for Utrecht and Goldsmiths/King's, it is high. The other groups have developed algorithms that would be interesting to compare, so their likelihood of entering is not too low.<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
1) Absolute measurement using the ground truth<br />
For each of the 11 queries from the ground truth, we have a certain number of groups of incipits, where we know the correct ranking of the groups. Let us assume that the total number of incipits in these groups is N. For each ranking returned by a method to be evaluated, take the top N incipits. At every border between groups, calculate the precision. To get one single measure, one can now integrate over this precision curve and divide the result by N. This measure will be a number between 0 and 1, where 0 means total failure and 1 means complete agreement with the ground truth. This measure will reflect not only the ability of algorithms to find as many relevant incipits as possible, but also to order them correctly.<br />
<br />
For example, let us assume that the ground truth has delivered the following three groups, with N=6:<br />
(a,b),(c,d,e),(f).<br />
<br />
Method X returns:<br />
b,a, g,c,d, e.<br />
<br />
Since there are 3 groups, we determine precision and recall at three points:<br />
<br />
after position 2: precision 1 (because both relevant documents a and b have been found. The correct order of a and b is not known since they are in the same group, therefore no penalty should be applied for them being in the opposite order).<br />
<br />
after position 5: precision 4/5=0.8 (because only g should not be among the first 5 incipits)<br />
<br />
after position 6: precision 5/6=0.83 (because only g should not be among the first 6)<br />
<br />
For an illustration, see the last 3 slides of http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/dir05pres.pdf<br />
<br />
2) Comparison of algorithms without using the ground truth<br />
To avoid the overfitting of the algorithms to the ground truth, select another group of queries after all algorithms have been submitted. Put all incipits returned by the algorithms into one pool. Remove doubles from the pool, divide the remaining incipits into portions of equal sizes, and distribute them back to the participants for relevance judgements (on a binary or ternary scale). Once the relevance is known for all returned incipits, use standard TREC measures for comparing the algorithms. This second comparison has some disadvantages: the complete set of relevant incipits won't be known, and neither will be the correct ranking. But the main disadvantage of the first comparison, the fact that the correct answer is known in advance, is avoided.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
RISM A/II collection (large collection of real-world compositions for which a ground truth has already been established).<br />
Possible copyright issue: Permission from RISM for distributing the test collection or parts of it to participants and using it for EvalFest is probably necessary.<br />
Obtaining the data is not difficult: the RISM CD is available in good libraries, and the data can be exported from the CD in plaine&easie format. Alternatively, the existing result of converting the data into MusicXML could be distributed.<br />
If RISM does not agree to the data being distributed, maybe it is possible to convince them to at least agree to a solution similar to the Naxos audio collection, where the data are quarantained, and only the software travels.<br />
<br />
==Review 1==<br />
<br />
Melodic similarity would have to be precisely defined, i.e. how was the ground truth established. <br />
<br />
The mentioned possible participants are really likely to participate (to my knowledge at least 10 out of the 14 listed).<br />
<br />
Part 1 - Absolute measurement using ground truth<br />
Can the number of queries be increased? 11 queries for a collection of 50000 incipits seem rather small (This comment can be ignored if there is difficulty in establishing the ground truth from such a large collection). <br />
Part 2 - Comparison of algorithms without using the ground truth <br />
Any ideas of how queries would be obtained ? Real-world queries ? Queries pooled from each participant ?<br />
<br />
The relevance assumptions have to be explicitly stated for the judging process by the various participants using this pooling approach.<br />
If ternary scale is used, details of this scale would be needed.<br />
What evaluation measure would be adopted from TREC?<br />
<br />
==Review 2==<br />
<br />
A very well- thought out proposal. First off, he has a database in mind, a large list of participants and thought carefully about evaluation. However, the number of likely participants is probably not as high as he believes and there are still issues with evaluation.<br />
<br />
I'm not totally clear about some aspects. These 500000 notated real world compositions- are they audio files or midi? How long are they and are they mono or polyphonic? How is the notation done? Judging from most of the names of participants, I'd guess he's talking about midi.<br />
<br />
His set of 11 queries with groundtruth is actually quite small, and there is still considerable work involved in getting groundtruth for other queries, especially since it requires feedback from many music experts.<br />
<br />
His method of evaluation through use of a groundtruth established by music experts should be compared with Jeremy's suggestion (check previous MIREX postings or just communicate with him directly) of using variations to identify melodic similarity.<br />
<br />
His section on evaluation is quite good, and precision/recall is the preferred method in the Info Retrieval community (just ask Stephen). Both algorithms seem feasible assuming the other issues have been worked out. If possible, both his suggested evaluation procedures should be implemented. Would be interesting to see if they give similar results, and if not, why?<br />
If the RISM A/II collection is really available from many libraries, then most of the work is done. Stephen's group can keep it quarantined and just run the algorithms that are provided. I think we don't even need to ask RISM for permission.<br />
<br />
I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that we have no problems with RISM.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Melodic&diff=6252005:Symbolic Melodic2005-02-01T20:45:30Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Rainer Typke (Universiteit Utrecht) rtypke@cs.uu.nl<br />
<br />
<br />
==Title==<br />
<br />
Retrieving melodically similar incipits from 500000 notated real world compositions<br />
<br />
<br />
==Description==<br />
<br />
Retrieving the most similar incipits from the RISM A/II collection, given one of the incipits as a query.<br />
<br />
Expected results: For 11 queries, a ground truth has been established (see Typke et al. 2005, http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/TR_groundtruth.pdf ). This ground truth makes it possible to measure the search results in an absolute way, taking into consideration not only relevance, but also the correct ranking.<br />
This ground truth has been obtained by combining ranked lists that were created by 35 music experts. The resulting ground truth has the form of ranked groups of incipits; the groups contain incipits whose differences in rankings were not statistically significant, but the ranking of the groups is statistically significant. MIR systems that perform well should return incipits such that the members of the groups are retrieved without violating the order of the groups.<br />
By using this ground truth, an absolute measure of quality is attainable. However, there is a danger of overfitting algorithms to just the 11 queries for which a ground truth is known. Therefore, there should also be an evaluation based on another set of queries that are chosen only after the algorithms have been submitted. This second part of the evaluation would only allow a comparison, no absolute measurement, but any dependance of the algorithms on the queries could be avoided.<br />
<br />
<br />
==Potential Participants==<br />
<br />
Tuomas Eerola/Petri Toiviainen Jyväskylä, Finland<br />
J├╝rgen Kilian/Holger Hoos Darmstadt/British Columbia, Canada<br />
Shyamala Doraisamy/Stefan Rueger Malaysia/London<br />
Maarten Grachten/Josep Lluis Arcos/Ramón López de Mántaras Bellaterra,<br />
Spain<br />
Giovanna Neve/Nicola Orio Pado Padova, Italia<br />
Anna Pienimäki/Kjell Lemström Helsinki, Finland<br />
Craig Sapp/Yi Wen Liu/Eleanor Selfridge-Field Stanford, US<br />
Daniel M├╝llensiefen/Klaus Frieler Hamburg<br />
Anna Lubiw/Luke Tanur Waterloo, Canada<br />
Michael Clausen Bonn<br />
Alexandra Uitdenbogerd, RMIT<br />
Rainer Typke, Frans Wiering, Remco C. Veltkamp, Universiteit Utrecht,<br />
{rainer.typke|frans.wiering|remcov}@cs.uu.nl<br />
Jeremy Pickens - jeremy@dcs.kcl.ac.uk<br />
Tim Crawford - t.crawford@gold.ac.uk<br />
<br />
The likelihood of entering is unknown for all groups except Utrecht and Goldsmiths/King's; for Utrecht and Goldsmiths/King's, it is high. The other groups have developed algorithms that would be interesting to compare, so their likelihood of entering is not too low.<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
1) Absolute measurement using the ground truth<br />
For each of the 11 queries from the ground truth, we have a certain number of groups of incipits, where we know the correct ranking of the groups. Let us assume that the total number of incipits in these groups is N. For each ranking returned by a method to be evaluated, take the top N incipits. At every border between groups, calculate the precision. To get one single measure, one can now integrate over this precision curve and divide the result by N. This measure will be a number between 0 and 1, where 0 means total failure and 1 means complete agreement with the ground truth. This measure will reflect not only the ability of algorithms to find as many relevant incipits as possible, but also to order them correctly.<br />
<br />
For example, let us assume that the ground truth has delivered the following three groups, with N=6:<br />
(a,b),(c,d,e),(f).<br />
<br />
Method X returns:<br />
b,a, g,c,d, e.<br />
<br />
Since there are 3 groups, we determine precision and recall at three points:<br />
<br />
after position 2: precision 1 (because both relevant documents a and b have been found. The correct order of a and b is not known since they are in the same group, therefore no penalty should be applied for them being in the opposite order).<br />
<br />
after position 5: precision 4/5=0.8 (because only g should not be among the first 5 incipits)<br />
<br />
after position 6: precision 5/6=0.83 (because only g should not be among the first 6)<br />
<br />
For an illustration, see the last 3 slides of http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/dir05pres.pdf<br />
<br />
2) Comparison of algorithms without using the ground truth<br />
To avoid the overfitting of the algorithms to the ground truth, select another group of queries after all algorithms have been submitted. Put all incipits returned by the algorithms into one pool. Remove doubles from the pool, divide the remaining incipits into portions of equal sizes, and distribute them back to the participants for relevance judgements (on a binary or ternary scale). Once the relevance is known for all returned incipits, use standard TREC measures for comparing the algorithms. This second comparison has some disadvantages: the complete set of relevant incipits won't be known, and neither will be the correct ranking. But the main disadvantage of the first comparison, the fact that the correct answer is known in advance, is avoided.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
RISM A/II collection (large collection of real-world compositions for which a ground truth has already been established).<br />
Possible copyright issue: Permission from RISM for distributing the test collection or parts of it to participants and using it for EvalFest is probably necessary.<br />
Obtaining the data is not difficult: the RISM CD is available in good libraries, and the data can be exported from the CD in plaine&easie format. Alternatively, the existing result of converting the data into MusicXML could be distributed.<br />
If RISM does not agree to the data being distributed, maybe it is possible to convince them to at least agree to a solution similar to the Naxos audio collection, where the data are quarantained, and only the software travels.<br />
<br />
==Review 1==<br />
<br />
==Review 2==<br />
<br />
A very well- thought out proposal. First off, he has a database in mind, a large list of participants and thought carefully about evaluation. However, the number of likely participants is probably not as high as he believes and there are still issues with evaluation.<br />
<br />
I'm not totally clear about some aspects. These 500000 notated real world compositions- are they audio files or midi? How long are they and are they mono or polyphonic? How is the notation done? Judging from most of the names of participants, I'd guess he's talking about midi.<br />
<br />
His set of 11 queries with groundtruth is actually quite small, and there is still considerable work involved in getting groundtruth for other queries, especially since it requires feedback from many music experts.<br />
<br />
His method of evaluation through use of a groundtruth established by music experts should be compared with Jeremy's suggestion (check previous MIREX postings or just communicate with him directly) of using variations to identify melodic similarity.<br />
<br />
His section on evaluation is quite good, and precision/recall is the preferred method in the Info Retrieval community (just ask Stephen). Both algorithms seem feasible assuming the other issues have been worked out. If possible, both his suggested evaluation procedures should be implemented. Would be interesting to see if they give similar results, and if not, why?<br />
If the RISM A/II collection is really available from many libraries, then most of the work is done. Stephen's group can keep it quarantined and just run the algorithms that are provided. I think we don't even need to ask RISM for permission.<br />
<br />
I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that we have no problems with RISM.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Key_Finding&diff=2622005:Symbolic Key Finding2005-02-01T20:44:56Z<p>138.37.33.58: /* Review 2 */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for audio data.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi): [high].<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi): [high].<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Craig Sapp (craig@ccrma.stanford.edu): [moderate].<br />
* David Temperley (dtemp@theory.esm.rochester.edu): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
MIDI Collections: MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithms' robustness to the various arrangements of instruments. <br />
<br />
Score-based Collections: Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
==Review 2==<br />
<br />
General comment: Title - Evaluation of Key Finding Algorithms Using Symbolic Data or Evaluation of Key Finding Algorithms Part 2<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
The test data are relevant for the problem. For the classical archive web-site, permission would have to be obtained for large downloads at a go, else progressive downloads would have to be worked on (since only a certain number of performances is allowed for downloads in a day).</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Key_Finding&diff=2612005:Symbolic Key Finding2005-02-01T20:44:29Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for audio data.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi): [high].<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi): [high].<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Craig Sapp (craig@ccrma.stanford.edu): [moderate].<br />
* David Temperley (dtemp@theory.esm.rochester.edu): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
MIDI Collections: MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithms' robustness to the various arrangements of instruments. <br />
<br />
Score-based Collections: Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
==Review 2==</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Key_Finding&diff=2602005:Symbolic Key Finding2005-02-01T20:43:35Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for audio data.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi): [high].<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi): [high].<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Craig Sapp (craig@ccrma.stanford.edu): [moderate].<br />
* David Temperley (dtemp@theory.esm.rochester.edu): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
MIDI Collections: MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithms' robustness to the various arrangements of instruments. <br />
<br />
Score-based Collections: Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.<br />
<br />
==Review 1==<br />
<br />
==Review 2==</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Genre_Class&diff=5542005:Symbolic Genre Class2005-02-01T20:42:47Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Cory McKay (McGill University) cory.mckay@mail.mcgill.ca<br />
<br />
<br />
==Title==<br />
<br />
Genre Classification of MIDI Files<br />
<br />
<br />
==Description==<br />
<br />
Submitted software will automatically classify MIDI recordings into genre categories.<br />
<br />
1) Genre Categories<br />
The genre categories will be organized hierarchically, in order to enable evaluation of how well entries can perform both coarse and fine classifications. The particular categories to be used will be determined by the evaluation committee. Individual recordings could belong to more than one category, as this is more realistic than requiring that each recording be classified as belonging to exactly one category. A total of three to five coarse categories and ten to fifteen fine categories will be used. Model classifications will be made by the evaluation committee or a sub-committee of the evaluation committee. Entrants will be provided with the selection and organization of categories so that they can configure their software to reflect them before submission.<br />
<br />
2) Training and Testing Recordings<br />
Training and testing recordings will be chosen by the evaluation committee and kept confidential until after evaluations are complete. The test recordings will then be released, copyrights permitting.<br />
<br />
3) Input Data<br />
Training will be performed by providing the software (through a command-line argument) with a text file listing training MIDI file paths and model genre(s). Testing will be performed by providing the software (through a command-line argument) with a text file that contains a list of file paths of test MIDI recordings.<br />
<br />
4) Output Data<br />
The software will produce a text file listing test recording file paths and the genre(s) that each has been classified as.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* George Tzanetakis (University of Victoria), gtzan@cs.uvic.ca, high likelihood<br />
* Cory McKay & Ichiro Fujinaga (McGill University), cory.mckay@mail.mcgill.ca, high likelihood<br />
* Pedro J. Ponce de Leon & Jose M. Inesta (Universidad de Alicante), pierre@dlsi.ua.es, medium likelihood<br />
* Roberto Basili, Alfredo Serafini & Armando Stellato (University of Rome Tor Vergata), basili@info.uniroma2.it, medium likelihood<br />
* Man-Kwan Shan & Fang-Fei Kuo (National Cheng Chi University), mkshan@cs.nccu.edu.tw, medium likelihood<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
Entries will be evaluated based on their success rates with respect ot both fine and coarse classifications. Entrants will have the option of enabling their software to output classifications of "unknown," which will be penalized less severely during evaluation than misclassifications, as classifications flagged as uncertain are much better than false classifications in a practical context. Evaluation will be performed using 5-fold cross validation.<br />
<br />
Submissions in C/C++, Java, MatLab and Python (and other languages?) will be accepted.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
* On-line repositories of MIDI files (sample links available at www.music.mcgill.ca/~cmckay/midi.html)<br />
* Research databases.<br />
<br />
==Review 1==<br />
<br />
The problem is very interesting for MIR, but too vaguely described. The role of the committee is not to propose anything, but to review the proposed evaluation sessions. Thus the author should propose a detailed list of genres and corresponding data.<br />
<br />
I'm not against organizing the genres hierarchically and associating several genres to each file, but this raises many issues that are not discussed at all here. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are coarse categories for classical and folk music, where lies the fine category of classical music adapted from folk songs ? I suggest that the contest concentrates on the single genre problem.<br />
<br />
The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Obviously the list of categories should reflect the list of MIDI music available on the internet. It would help if some data were already labeled according to this list.<br />
<br />
The list of relevant data should be developed. How many files are needed for learning and testing ? Have the participants already collected some labeled data that they could give to the organizers ? How much ?<br />
<br />
Regarding the release of the data, I think that it would be better not to release anything. The training and test data should always be accessible through the D2K interface, and thus no copyright problem would appear. Is it possible to ensure that the test data are used only for testing and not for learning ? Is it possible to implement learning easily in M2K ? (each algorithm may use different structures to store learnt data)<br />
<br />
Finally, the evaluation procedure seems nice, but I don't have any clue whether the proposed participants are really interested.<br />
<br />
==Review 2==<br />
<br />
This is an interesting topic, one that I haven't seen much work on. I do not believe that its difficult to get a large collection of midi files. Many are in public domain, were never intended to be copyrighted, or have copyleft / creative commons licences. However, its still difficult to assemble a reasonable collection of midi files of appropriate length which accurately represent a sufficient number of genres. This must be addressed.<br />
<br />
A key point is that it requires the Contest Committee to handlabel a large number of midi files. We also need to determine what our genres are. Is the Committee capable and willing to do this? <br />
I personally would find it very difficult to determine the genre of a midi recording which I don't recognize. MIDI all sounds like Muzak to me, unless I know the original audio recording. Has anyone tried midi-based genre classification before?<br />
<br />
I have no problems with the suggested evaluation and testing procedures.<br />
<br />
I think we need some more feedback on whether people are really interested in this. Most researchers who use MIDI, to my knowledge, aren't concerned with genre issues. George typically works with audio, so the proposer is the only one I'm aware of who I know is interested. I could be wrong so lets ask around. We also need to explore the handlabelling task, and to see if we can assemble a decent collection (which we should do regardless of this proposal). <br />
<br />
If there is significant interest, and the labeling can be done, then we should accept it.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Genre_Class&diff=5532005:Symbolic Genre Class2005-02-01T20:42:32Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Cory McKay (McGill University) cory.mckay@mail.mcgill.ca<br />
<br />
<br />
==Title==<br />
<br />
Genre Classification of MIDI Files<br />
<br />
<br />
==Description==<br />
<br />
Submitted software will automatically classify MIDI recordings into genre categories.<br />
<br />
1) Genre Categories<br />
The genre categories will be organized hierarchically, in order to enable evaluation of how well entries can perform both coarse and fine classifications. The particular categories to be used will be determined by the evaluation committee. Individual recordings could belong to more than one category, as this is more realistic than requiring that each recording be classified as belonging to exactly one category. A total of three to five coarse categories and ten to fifteen fine categories will be used. Model classifications will be made by the evaluation committee or a sub-committee of the evaluation committee. Entrants will be provided with the selection and organization of categories so that they can configure their software to reflect them before submission.<br />
<br />
2) Training and Testing Recordings<br />
Training and testing recordings will be chosen by the evaluation committee and kept confidential until after evaluations are complete. The test recordings will then be released, copyrights permitting.<br />
<br />
3) Input Data<br />
Training will be performed by providing the software (through a command-line argument) with a text file listing training MIDI file paths and model genre(s). Testing will be performed by providing the software (through a command-line argument) with a text file that contains a list of file paths of test MIDI recordings.<br />
<br />
4) Output Data<br />
The software will produce a text file listing test recording file paths and the genre(s) that each has been classified as.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* George Tzanetakis (University of Victoria), gtzan@cs.uvic.ca, high likelihood<br />
* Cory McKay & Ichiro Fujinaga (McGill University), cory.mckay@mail.mcgill.ca, high likelihood<br />
* Pedro J. Ponce de Leon & Jose M. Inesta (Universidad de Alicante), pierre@dlsi.ua.es, medium likelihood<br />
* Roberto Basili, Alfredo Serafini & Armando Stellato (University of Rome Tor Vergata), basili@info.uniroma2.it, medium likelihood<br />
* Man-Kwan Shan & Fang-Fei Kuo (National Cheng Chi University), mkshan@cs.nccu.edu.tw, medium likelihood<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
Entries will be evaluated based on their success rates with respect ot both fine and coarse classifications. Entrants will have the option of enabling their software to output classifications of "unknown," which will be penalized less severely during evaluation than misclassifications, as classifications flagged as uncertain are much better than false classifications in a practical context. Evaluation will be performed using 5-fold cross validation.<br />
<br />
Submissions in C/C++, Java, MatLab and Python (and other languages?) will be accepted.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
* On-line repositories of MIDI files (sample links available at www.music.mcgill.ca/~cmckay/midi.html)<br />
* Research databases.<br />
<br />
==Review 1==<br />
<br />
<br />
==Review 2==<br />
<br />
This is an interesting topic, one that I haven't seen much work on. I do not believe that its difficult to get a large collection of midi files. Many are in public domain, were never intended to be copyrighted, or have copyleft / creative commons licences. However, its still difficult to assemble a reasonable collection of midi files of appropriate length which accurately represent a sufficient number of genres. This must be addressed.<br />
<br />
A key point is that it requires the Contest Committee to handlabel a large number of midi files. We also need to determine what our genres are. Is the Committee capable and willing to do this? <br />
I personally would find it very difficult to determine the genre of a midi recording which I don't recognize. MIDI all sounds like Muzak to me, unless I know the original audio recording. Has anyone tried midi-based genre classification before?<br />
<br />
I have no problems with the suggested evaluation and testing procedures.<br />
<br />
I think we need some more feedback on whether people are really interested in this. Most researchers who use MIDI, to my knowledge, aren't concerned with genre issues. George typically works with audio, so the proposer is the only one I'm aware of who I know is interested. I could be wrong so lets ask around. We also need to explore the handlabelling task, and to see if we can assemble a decent collection (which we should do regardless of this proposal). <br />
<br />
If there is significant interest, and the labeling can be done, then we should accept it.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Symbolic_Key_Finding&diff=2592005:Symbolic Key Finding2005-02-01T20:42:03Z<p>138.37.33.58: /* Potential Participants */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for audio data.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Olli Yli-Harja (yliharja@cs.tut.fi), Ilya Schmulevich (is@ieee.org), and Kjell Lemstr├╢m (kjell.lemstrom@cs.helsinki.fi): [high].<br />
* Tuomas Eerola (ptee@cc.jyu.fi) and Petri Toiviainen (ptoiviai@cc.jyu.fi): [high].<br />
* Arpi Mardirossian (mardiros@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Craig Sapp (craig@ccrma.stanford.edu): [moderate].<br />
* David Temperley (dtemp@theory.esm.rochester.edu): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
MIDI Collections: MIDI data are a symbolic representation of music. It provides a numeric representation of the pitch and onset/offset time and velocity for every event in a musical piece. Classical Archive website (http://www.classicalarchives.com) provides more than thirty thousands full length classical music files by more than two thousands composers in MIDI format. All the files are presented with full name, and composer. Also, most of files state the key clearly. Music by different composers may be used to test the range of the algorithm. Multiple versions of a piece may be used to test the algorithms' robustness to the various arrangements of instruments. <br />
<br />
Score-based Collections: Score-based data are also symbolic representations of music. In addition to numeric event information, it also provides further pitch and time structure information such as contextually correct note names, and key and time signatures. MusData (http://www.musedata.org), for example, provides access to such a score-based collection.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Tempo_Extraction&diff=7212005:Audio Tempo Extraction2005-02-01T20:41:09Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Martin F. McKinney (Philips) martin.mckinney@philips.com<br />
<br />
<br />
==Title==<br />
<br />
Automatic tempo extraction<br />
<br />
<br />
==Description==<br />
<br />
This contest will compare current methods for the extraction of tempo from musical audio. We distinguish between notated tempo and perceptual tempo and will test for the extraction of perceptual tempo. We will also test for tempo following if there is enough interest.<br />
<br />
We differentiate between notated tempo and perceived tempo. If you have the notated tempo (e.g., from the score) it is straightforward attach a tempo annotation to an excerpt and run a contest for algorithms to predict the notated tempo. For excerpts for which we have no "official" tempo annotation, we can also annotate the *perceived* tempo. This is not a straightforward task and needs to be done carefully. If you ask a group of listeners (including skilled musicians) to annotate the tempo of music excerpts, they can give you different answers (they tap at different metrical levels) if they are unfamiliar with the piece. For some excerpts the perceived pulse or tempo is less ambiguous and everyone taps at the same metrical level, but for other excerpts the tempo can be quite ambiguous and you get a complete split across listeners.<br />
<br />
The annotation of perceptual tempo can take several forms: a probability density function as a function of tempo; a series of tempos, ranked by their respective perceptual salience; etc. These measures of perceptual tempo can be used as a ground truth on which to test algorithms for tempo extraction. The dominant perceived tempo is sometimes the same as the notated tempo but not always. A piece of music can "feel" faster or slower than it's notated tempo in that the dominant perceived pulse can be a metrical level higher or lower than the notated tempo.<br />
<br />
There are several reasons to examine the perceptual tempo, either in place of or in addition to the notated tempo. For many applications of automatic tempo extractors, the perceived tempo of the music is more relevant than the notated tempo. An automatic playlist generator or music navigator, for instance, might allow listeners to select or filter music by its (automatically extracted) tempo. In this case, the "feel", or perceptual tempo may be more relevant than the notated tempo. An automatic DJ apparatus might also perform better with a representation of perceived tempo rather than notated tempo.<br />
<br />
A more pragmatic reason for using perceptual tempo rather than notated tempo as a ground truth for our contest is that we simply do not have the notated tempo of our test set. If we notate it by having a panel of expert listeners tap along and label the excerpts, we are by default dealing with the perceived tempo. The handling of this data as ground truth must be done with care.<br />
<br />
<br />
==Potential Participants==<br />
<br />
Last years' participants and organizers (Unconfirmed!):<br />
* Fabien Gouyon (fgouyon@iua.upf.es)<br />
* Miguel Alonso (miguel.alonso@enst.fr)<br />
* Simon Dixon (simon@oefai.at)<br />
* Christian Uhle (uhle@idmt.fraunhofer.de)<br />
* George Tzanetakis (gtzan@cs.uvic.ca)<br />
* Anssi Klapuri (klap@cs.tut.fi)<br />
<br />
==Evaluation Procedures==<br />
<br />
This section focuses on the mechanics of the method while we discuss the data (music excerpts and perceptual data) in the next section. There are two general steps to the method: 1) collection of perceptual tempo annotations; and 2) evaluation of tempo extraction algorithms.<br />
<br />
1) Perceptual tempo data collection<br />
The following procedure is described in more detail in McKinney and Moelants (2004) and Moelants and McKinney (2004). Listeners will be asked to tap to the beat of a series of musical excerpts. Responses will be collected and their perceived tempo will be calculated. For each excerpt, a distribution of perceived tempo will be generated. A relatively simple form of perceived tempo is proposed for this contest: The two highest peaks in the perceived tempo distribution for each excerpt will be taken, along with their respective heights (normalized to sum to 1.0) as the two tempo candidates for that particular excerpt. The height of a peak in the distribution is assumed to represent the perceptual salience of that tempo. In addition to tempo, the phase and tapping times of listeners will also be recorded for possible evaluation of phase-locking and tempo following of tempo-extraction algorithms.<br />
<br />
References:<br />
* McKinney, M.F. and Moelants, D. (2004), Deviations from the resonance theory of tempo induction, Conference on Interdisciplinary Musicology, Graz. URL: http://gewi.kfunigraz.ac.at/~cim04/CIM04_paper_pdf/McKinney_Moelants_CIM04_proceedings_t.pdf<br />
* Moelants, D. and McKinney, M.F. (2004), Tempo perception and musical content: What makes a piece slow, fast, or temporally ambiguous? International Conference on Music Perception & Cognition, Evanston, IL. URL: http://www.northwestern.edu/icmpc/proceedings/ICMPC8/PDF/AUTHOR/MP040237.PDF<br />
<br />
2) Evaluation of tempo extraction algorithms<br />
Algorithms will process musical excerpts and be rated on the following tasks:<br />
* Ability to identify the most salient (primary) tempo (to within 3%)<br />
* Ability to identify the 2nd most salient (secondary) tempo (to within 3%)<br />
* Ability to identify an integer multiple of the primary tempo (to within 3%) (this task is a given if task 1 is performed correctly)<br />
* Ability to identify an integer multiple of secondary tempo (to within 3%) (this task is a given if task 2 is performed correctly)<br />
* (optional) Ability to correctly identify phase of tempo<br />
* (optional) Ability to follow tempo on excerpts with varying tempo<br />
<br />
==Relevant Test Collections==<br />
<br />
From previous studies on tempo perception (see references) we have 3 sets of annotated musical excerpts:<br />
* 24 10-second excerpts annotated by 33 subjects, excerpts were taken primarily from Western popular music.<br />
* 60 30-second excerpts annotated by 24 subjects, excerpts were taken from Western popular, classical and world ethnic music.<br />
* 50 30-second excerpts annotated by 40 subjects, excerpts were taken from a broad range of musical styles.<br />
<br />
The 10-second excerpts from our first set may be too short. I think we might want to stick with longer excerpts (15 seconds or longer). In addition, we could conduct further listening/tapping sessions in order to supplement the current set of annotations. Source material could come from different sources:<br />
* Miguel Alonso (ENST) has a database with several hundred excerpts<br />
* (?) Fabien Gouyon (?) Last years database (?)<br />
* Other labs working on tempo extraction<br />
* Public music databases<br />
<br />
We will also provide some measures of statistical significance to the results, most likely through bootstrapping the test data.<br />
<br />
Concerning copyright issues: I'm not sure if there will be any issues here if all music is simply collected in one place and then the contest algorithms are run there. In addition, I've heard that it is legal to use/distribute short excerpts of recorded audio without violation. Can anyone confirm/deny or provide more info on copyright issues for short excerpts?<br />
<br />
==Review 1==<br />
<br />
I think that your proposal is clearly written and definitely appropriate for ISMIR. I agree with your justification for the analysis of perceptual tempo, especially for applications related to human interaction. However, in order to build upon last year's contest I would support the inclusion of 'phase locking' and 'tempo following' as areas to investigate under this proposal, in addition to some further consideration of the evaluation procedures.<br />
<br />
I think your list of participants is realistic - equally I believe many of these people have published work on beat tracking as well as tempo analysis, which suggests there should be support for an expanded proposal.<br />
<br />
In terms of the data to be analyzed, I agree that longer excerpts are necessary, especially if the proposal is to be expanded to incorporate tempo following and phase information. I wonder if it might be interesting to classify input signals not only by genre (as you suggest), but also by the presence or absence of percussion. This would be another way to demonstrate the generality of the entered algorithms, but might also provide some further insight into those signals for which the perceptual tempo is most open to subjective interpretation - put simply: is there more agreement (computationally and in annotations) when drums are present? I would also like to see some consideration given to examples that aren't in 4/4 time, as well those which are heavily syncopated (if not already present in the proposed databases). <br />
<br />
I have a couple of concerns regarding the evaluation criteria, particularly related to the second most salient level:<br />
* Should it be mandatory for participants to look for more than one appropriate tempo for a given input signal?<br />
* I can see from your description of the data collection that extracting two levels is not too hard, however is it generally intuitive whether this secondary level is faster or slower than the primary level? I wonder if it might be more valuable to find something more explicit, like the tatum (fastest metrical level), or time-signature. I'm not sure if these would count as perceptual tempi if no one actually chooses to tap that quickly or slowly.<br />
* In cases (if any) where there is complete agreement on the perceptual level, how would the second most salient level be defined? <br />
<br />
I think you're right to suggest a tempo dependent threshold, but I'm interested as to where this value of 3% comes from. Might it be a little too strict? Was this the value suggested for last year's contest?<br />
<br />
Given that your annotated data for perceptual tempo is derived from subjects 'tapping' along to music, it seems worthwhile expanding the scope of this proposal to include phase information and tempo following. Perhaps optionally making this a tempo and beat tracking contest. Again I'm aware of the potential problems in deriving a globally acceptable strategy for the evaluation of beat locations, (current examples include: Goto-97 the longest continuous correctly tracked segment, or Scheirer-98 RMS deviation between algorithm and annotated beats) but I think this is a factor which should be addressed.<br />
<br />
==Review 2==<br />
<br />
The problem is a relevant MIR task which is clearly defined. The proposed participants seem likely to participate indeed.<br />
<br />
I do not have much to say, since this proposal is already very solid.<br />
<br />
I appreciate the fact that the potential participants already own a large amount of annotated data, so that the work to annotate new data will be limited. However it seems that a large number of listeners is needed for annotation, because several perceptual tempos are taken into account for evaluation. Would it be possible to propose evaluation measures that are relevant whatever the number of annotaters (probably less than five annotaters will be available for new annotations) ? Or to evaluate performance differently on each file depending on the amount of annotaters ?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Tempo_Extraction&diff=7202005:Audio Tempo Extraction2005-02-01T20:40:30Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Martin F. McKinney (Philips) martin.mckinney@philips.com<br />
<br />
<br />
==Title==<br />
<br />
Automatic tempo extraction<br />
<br />
<br />
==Description==<br />
<br />
This contest will compare current methods for the extraction of tempo from musical audio. We distinguish between notated tempo and perceptual tempo and will test for the extraction of perceptual tempo. We will also test for tempo following if there is enough interest.<br />
<br />
We differentiate between notated tempo and perceived tempo. If you have the notated tempo (e.g., from the score) it is straightforward attach a tempo annotation to an excerpt and run a contest for algorithms to predict the notated tempo. For excerpts for which we have no "official" tempo annotation, we can also annotate the *perceived* tempo. This is not a straightforward task and needs to be done carefully. If you ask a group of listeners (including skilled musicians) to annotate the tempo of music excerpts, they can give you different answers (they tap at different metrical levels) if they are unfamiliar with the piece. For some excerpts the perceived pulse or tempo is less ambiguous and everyone taps at the same metrical level, but for other excerpts the tempo can be quite ambiguous and you get a complete split across listeners.<br />
<br />
The annotation of perceptual tempo can take several forms: a probability density function as a function of tempo; a series of tempos, ranked by their respective perceptual salience; etc. These measures of perceptual tempo can be used as a ground truth on which to test algorithms for tempo extraction. The dominant perceived tempo is sometimes the same as the notated tempo but not always. A piece of music can "feel" faster or slower than it's notated tempo in that the dominant perceived pulse can be a metrical level higher or lower than the notated tempo.<br />
<br />
There are several reasons to examine the perceptual tempo, either in place of or in addition to the notated tempo. For many applications of automatic tempo extractors, the perceived tempo of the music is more relevant than the notated tempo. An automatic playlist generator or music navigator, for instance, might allow listeners to select or filter music by its (automatically extracted) tempo. In this case, the "feel", or perceptual tempo may be more relevant than the notated tempo. An automatic DJ apparatus might also perform better with a representation of perceived tempo rather than notated tempo.<br />
<br />
A more pragmatic reason for using perceptual tempo rather than notated tempo as a ground truth for our contest is that we simply do not have the notated tempo of our test set. If we notate it by having a panel of expert listeners tap along and label the excerpts, we are by default dealing with the perceived tempo. The handling of this data as ground truth must be done with care.<br />
<br />
<br />
==Potential Participants==<br />
<br />
Last years' participants and organizers (Unconfirmed!):<br />
* Fabien Gouyon (fgouyon@iua.upf.es)<br />
* Miguel Alonso (miguel.alonso@enst.fr)<br />
* Simon Dixon (simon@oefai.at)<br />
* Christian Uhle (uhle@idmt.fraunhofer.de)<br />
* George Tzanetakis (gtzan@cs.uvic.ca)<br />
* Anssi Klapuri (klap@cs.tut.fi)<br />
<br />
==Evaluation Procedures==<br />
<br />
This section focuses on the mechanics of the method while we discuss the data (music excerpts and perceptual data) in the next section. There are two general steps to the method: 1) collection of perceptual tempo annotations; and 2) evaluation of tempo extraction algorithms.<br />
<br />
1) Perceptual tempo data collection<br />
The following procedure is described in more detail in McKinney and Moelants (2004) and Moelants and McKinney (2004). Listeners will be asked to tap to the beat of a series of musical excerpts. Responses will be collected and their perceived tempo will be calculated. For each excerpt, a distribution of perceived tempo will be generated. A relatively simple form of perceived tempo is proposed for this contest: The two highest peaks in the perceived tempo distribution for each excerpt will be taken, along with their respective heights (normalized to sum to 1.0) as the two tempo candidates for that particular excerpt. The height of a peak in the distribution is assumed to represent the perceptual salience of that tempo. In addition to tempo, the phase and tapping times of listeners will also be recorded for possible evaluation of phase-locking and tempo following of tempo-extraction algorithms.<br />
<br />
References:<br />
* McKinney, M.F. and Moelants, D. (2004), Deviations from the resonance theory of tempo induction, Conference on Interdisciplinary Musicology, Graz. URL: http://gewi.kfunigraz.ac.at/~cim04/CIM04_paper_pdf/McKinney_Moelants_CIM04_proceedings_t.pdf<br />
* Moelants, D. and McKinney, M.F. (2004), Tempo perception and musical content: What makes a piece slow, fast, or temporally ambiguous? International Conference on Music Perception & Cognition, Evanston, IL. URL: http://www.northwestern.edu/icmpc/proceedings/ICMPC8/PDF/AUTHOR/MP040237.PDF<br />
<br />
2) Evaluation of tempo extraction algorithms<br />
Algorithms will process musical excerpts and be rated on the following tasks:<br />
* Ability to identify the most salient (primary) tempo (to within 3%)<br />
* Ability to identify the 2nd most salient (secondary) tempo (to within 3%)<br />
* Ability to identify an integer multiple of the primary tempo (to within 3%) (this task is a given if task 1 is performed correctly)<br />
* Ability to identify an integer multiple of secondary tempo (to within 3%) (this task is a given if task 2 is performed correctly)<br />
* (optional) Ability to correctly identify phase of tempo<br />
* (optional) Ability to follow tempo on excerpts with varying tempo<br />
<br />
==Relevant Test Collections==<br />
<br />
From previous studies on tempo perception (see references) we have 3 sets of annotated musical excerpts:<br />
* 24 10-second excerpts annotated by 33 subjects, excerpts were taken primarily from Western popular music.<br />
* 60 30-second excerpts annotated by 24 subjects, excerpts were taken from Western popular, classical and world ethnic music.<br />
* 50 30-second excerpts annotated by 40 subjects, excerpts were taken from a broad range of musical styles.<br />
<br />
The 10-second excerpts from our first set may be too short. I think we might want to stick with longer excerpts (15 seconds or longer). In addition, we could conduct further listening/tapping sessions in order to supplement the current set of annotations. Source material could come from different sources:<br />
* Miguel Alonso (ENST) has a database with several hundred excerpts<br />
* (?) Fabien Gouyon (?) Last years database (?)<br />
* Other labs working on tempo extraction<br />
* Public music databases<br />
<br />
We will also provide some measures of statistical significance to the results, most likely through bootstrapping the test data.<br />
<br />
Concerning copyright issues: I'm not sure if there will be any issues here if all music is simply collected in one place and then the contest algorithms are run there. In addition, I've heard that it is legal to use/distribute short excerpts of recorded audio without violation. Can anyone confirm/deny or provide more info on copyright issues for short excerpts?<br />
<br />
==Review 1==<br />
<br />
==Review 2==<br />
<br />
The problem is a relevant MIR task which is clearly defined. The proposed participants seem likely to participate indeed.<br />
<br />
I do not have much to say, since this proposal is already very solid.<br />
<br />
I appreciate the fact that the potential participants already own a large amount of annotated data, so that the work to annotate new data will be limited. However it seems that a large number of listeners is needed for annotation, because several perceptual tempos are taken into account for evaluation. Would it be possible to propose evaluation measures that are relevant whatever the number of annotaters (probably less than five annotaters will be available for new annotations) ? Or to evaluate performance differently on each file depending on the amount of annotaters ?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3762005:Audio Onset Detect2005-02-01T20:39:53Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) Input data<br />
Audio format:<br />
The data will be monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* the file length is not critical for that task, but 30 seconds max. excerpts would be convenient if we want to have a correct diversity in the dataset. It must be reminded that real-world sounds must be manually annotated (painful and time-consuming task, as pointed by J. Bello at MIREX 2004).<br />
<br />
Audio content:<br />
The dataset will be subdivided into classes. This idea has been evoked by D. Ellis at last MIREX. The reasons why:<br />
* onset detection are performed in various applications, some of them are dedicated for a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...)<br />
* the composition of the entire database can determine the relative rank of the onset detection algorithms. For example, an evaluation of a dataset principally composed of complex mixes will not emphasize an onset detection performing well on solo phrases of bowed strings, but a little less than the others on complex mixes.<br />
* it can show the weak points of the compared methods. I think it is more useful than an evaluation based on an overall success percentage or curve. <br />
Suggestions for such classes:<br />
We can define 2 types of subdivisions:<br />
* monophonic instruments solo phrases<br />
* polyphonic instruments solo phrases<br />
* complex mixes<br />
Or, as suggested by Bello and al.:<br />
* pitched percussive instruments phrases<br />
* pitched non-percussive instruments phrases<br />
* non-pitched percussive instruments phrases<br />
* complex mixes<br />
<br />
Meta data:<br />
Two types of annotation can be provided:<br />
* Manual annotation for the real word sounds. For this type of annotation, our article mentions these potential difficulties:<br />
* Midi score for synthesized sounds or MIDI commanded instruments. They are considered as robust ground-truth.<br />
<br />
Notes on annotation:<br />
As mentioned above, the sound files will be provided with their onset time annotation. The ground-truth we will define can be critical for the evaluation.<br />
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.<br />
For real world sounds, annotation volunteers are needed. The annotations should be cross-validated (errare humanum est). Precise instructions on which events to annotate must be given to the annotators.<br />
Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.How the annotation is taken into account must be precisely defined... my opinion is to discard sound events that are not music notes, for example breathing, key strokes etc..., that are quite frequent in the solo recordings, even if they're detected by most of the onset detection algorithms...<br />
<br />
Article and matlab tool for annotation by Pierre Leveau et al.<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm<br />
<br />
2) Output data<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a correct detection. If not, it is a false positive.<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...<br />
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...<br />
<br />
==Review 1==<br />
<br />
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.<br />
<br />
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.<br />
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.<br />
<br />
It does not mention whether there will be training data available to participants.<br />
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Onset_Detect&diff=3752005:Audio Onset Detect2005-02-01T20:39:34Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk<br />
<br />
<br />
==Title==<br />
<br />
Onset Detection Contest<br />
<br />
<br />
==Description==<br />
<br />
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.<br />
<br />
1) Input data<br />
Audio format:<br />
The data will be monophonic sound files, with the associated onset times and<br />
data about the annotation robustness.<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* the file length is not critical for that task, but 30 seconds max. excerpts would be convenient if we want to have a correct diversity in the dataset. It must be reminded that real-world sounds must be manually annotated (painful and time-consuming task, as pointed by J. Bello at MIREX 2004).<br />
<br />
Audio content:<br />
The dataset will be subdivided into classes. This idea has been evoked by D. Ellis at last MIREX. The reasons why:<br />
* onset detection are performed in various applications, some of them are dedicated for a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...)<br />
* the composition of the entire database can determine the relative rank of the onset detection algorithms. For example, an evaluation of a dataset principally composed of complex mixes will not emphasize an onset detection performing well on solo phrases of bowed strings, but a little less than the others on complex mixes.<br />
* it can show the weak points of the compared methods. I think it is more useful than an evaluation based on an overall success percentage or curve. <br />
Suggestions for such classes:<br />
We can define 2 types of subdivisions:<br />
* monophonic instruments solo phrases<br />
* polyphonic instruments solo phrases<br />
* complex mixes<br />
Or, as suggested by Bello and al.:<br />
* pitched percussive instruments phrases<br />
* pitched non-percussive instruments phrases<br />
* non-pitched percussive instruments phrases<br />
* complex mixes<br />
<br />
Meta data:<br />
Two types of annotation can be provided:<br />
* Manual annotation for the real word sounds. For this type of annotation, our article mentions these potential difficulties:<br />
* Midi score for synthesized sounds or MIDI commanded instruments. They are considered as robust ground-truth.<br />
<br />
Notes on annotation:<br />
As mentioned above, the sound files will be provided with their onset time annotation. The ground-truth we will define can be critical for the evaluation.<br />
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.<br />
For real world sounds, annotation volunteers are needed. The annotations should be cross-validated (errare humanum est). Precise instructions on which events to annotate must be given to the annotators.<br />
Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.How the annotation is taken into account must be precisely defined... my opinion is to discard sound events that are not music notes, for example breathing, key strokes etc..., that are quite frequent in the solo recordings, even if they're detected by most of the onset detection algorithms...<br />
<br />
Article and matlab tool for annotation by Pierre Leveau et al.<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf<br />
<br />
http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm<br />
<br />
2) Output data<br />
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Tampere University of Technnology, Audio Research Group<br />
Ansii Klapuri <klap@cs.tut.fi><br />
* MIT, MediaLab<br />
Tristan Jehan <tristan@medialab.mit.edu><br />
* LAM, France<br />
Pierre Leveau <leveau@lam.jussieu.fr><br />
Laurent Daudet <daudet@lam.jussieu.fr><br />
* IRCAM, France<br />
Xavier Rodet <rod@ircam.fr><br />
* University of Pompeo Fabra, Multimedia Technology Group<br />
Julien Ricard <jricard@iua.upf.es><br />
Fabien Gouyon <fgouyon@iua.upf.es><br />
* Queen Mary College, Centre for Digital Music<br />
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk><br />
Paul Brossier <paul.brossier@qmul.elec.ac.uk><br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a correct detection. If not, it is a false positive.<br />
Evaluation measures:<br />
* percentage of correct detections / false positives (can also be expressed as precision/recall)<br />
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.<br />
* separate scoring for different instrument types (percussive, strings, winds) <br />
More detailed data:<br />
* percentage of doubled detections<br />
* speed measurements of the algorithms<br />
* scalability to large files<br />
* robustness to noise, loudness<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...<br />
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...<br />
<br />
==Review 1==<br />
<br />
<br />
==Review 2==<br />
<br />
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.<br />
<br />
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.<br />
<br />
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.<br />
<br />
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.<br />
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.<br />
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.<br />
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.<br />
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.<br />
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.<br />
<br />
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.<br />
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Key_Finding&diff=3472005:Audio Key Finding2005-02-01T20:38:44Z<p>138.37.33.58: /* Potential Participants */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for MIDI data. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments. <br />
<br />
<br />
==Potential Participants==<br />
<br />
* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].<br />
* Steffen Pauws (steffen.pauws@philips.com): [high].<br />
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
* Ozgur Izmirli (oizm@conncoll.edu): [moderate].<br />
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Melody_Extr&diff=1352005:Audio Melody Extr2005-02-01T20:37:51Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Graham Poliner (Columbia University) graham@ee.columbia.edu<br />
<br />
<br />
==Title==<br />
<br />
Melody Extraction of Polyphonic Audio<br />
<br />
<br />
==Description==<br />
<br />
The melodic content of polyphonic audio provides an intuitive representation for summarization and retrieval. Numerous potential approaches exist for automated melody extraction; therefore, the MIREX 2005 Melody Extraction Evaluation seeks to compare the accuracy of state-of-the-art melody transcription algorithms. The evaluation data set will consist of an eclectic collection of audio excerpts along with the corresponding frame-based transcription of the dominant voice. The performance of the submitted algorithms will be evaluated based on the percentage of frames correctly transcribed. <br />
<br />
<br />
==Potential Participants==<br />
<br />
*Juan P. Bello - juan.bello-correa@elec.qmul.ac.uk - Very Likely<br />
*Ali Taylan Cemgil - cemgil@science.uva.nl - Moderately Likely<br />
*Emilia Gomez - emilia.gomez@iua.upf.es - Likely<br />
*Masataka Goto - m.goto@aist.go.jp - Moderately Likely<br />
*Jana Eggink - j.eggink@dcs.shef.ac.uk - Moderately Likely<br />
*Anssi Klapuri - klap@cs.tut.fi - Moderately Likely<br />
*Matija Marolt - matija.marolt@fri.uni-lj.si - Moderately Likely<br />
*Rui Pedro Paiva - ruipedro@dei.uc.pt - Very Likely<br />
*Graham Poliner - graham@ee.columbia.edu - Very Likely<br />
*Sven Tappert - s_tappert@yahoo.de - Very Likely<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
Following the evaluation procedure specified for the ISMIR 2004 Melody Contest<br />
*Option 1 - A frame-based comparison between the predicted and reference melody<br />
The total prediction accuracy may be computed by calculating the average absolute difference for each frame where a maximal error is defined as one semitone = 100 cents and a value of 0 Hz may be assigned to unvoiced segments. <br />
*Option 2 - A frame-based comparison between the predicted and reference melody over a one-octave range<br />
This option is the same as Option 1; however, the predicted melody and reference melody are mapped into the range of one octave before calculating the absolute difference.<br />
*Option 3 - Edit distance between the estimated melody and the correct melody<br />
Following the edit distance calculation outlined in Grachten et al. 2002<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse set of audio segments and corresponding melody transcriptions. Due to the success of the ISMIR 2004 Melody Competition, we recommend that the evaluation set be reused and augmented with additional audio excerpts from such genres as pop, jazz, digital, and opera. The new ground truth may be created by manually correcting the output of current melody transcription algorithms. We may also wish to consider representing the genres in different proportions for the MIREX 2005 evaluation. <br />
The inclusion of popular music may result in additional copyright issues. Copyright law prohibits the universal or unlimited distribution of material on the web. However, if access to the media is limited to MIREX participants, this should be considered a fair use of the copyrighted materials.<br />
<br />
<br />
==Review 1==<br />
<br />
Problem is reasonably well defined and would be considered interesting in terms of current research.<br />
<br />
No mention of audio format/sampling rate, will assume:<br />
* CD-quality (CM, 16-bit, 44100 Hz)<br />
* mono<br />
* 30 seconds excerpts<br />
* files are named as "001.wav" to "999.wav"<br />
No mention of frame size or hop size, will this be the same as 2004 competition (Frame size 2048, hop size 256)? Is this optimal? Would some participants prefer to use different sizes. Could the proposed evaluation metrics be modified to use absolute time indexes and a tolerance and therefore be independent of framing?<br />
<br />
In the proposed evaluation metrics there is no mention of whether option 1 and option two will be averages as they were last year, or how option 3 will be combined with these.<br />
Statistical significance of differences between submissions should be estimated.<br />
<br />
Re-use and augmentation of last year's database is fine, however there is no mention of where new data will come from. Obviously the Magnatune database would be a good source, as this can also be distributed, however it may be best to distribute last years database and hold back new examples. How big should new database be? 50 files? I assume there are likely to be no trained submissions, or they will be pre-trained therefore a single pass over the data should be fine. There is also no mention of how many non-participating transcribers will produce the ground-truth and how differences in transcriptions will be resolved. Given IP status of Magnatune database, distribution to transcribers should not be a problem.<br />
<br />
Given the high number of potential participants, I think we can be confident of sufficient participation to run the evaluation.<br />
<br />
Recommendation: Significant refinements to proposal and accept.<br />
<br />
==Review 2==<br />
<br />
This problem is well defined and very relevant to MIR.<br />
<br />
The mentioned possible participants are really working in the field. However, the participants marked as "very likely" the same people that participated last year, while some key researchers in the field are modestly marked as "moderately likely". I believe that for this evaluation to be meaningful, the organizers should secure the participation of Masataka Goto (whose PreFest algorithm is still the main reference for melody extraction), Matija Marolt, Jana Eggink (both of whom published relevant work last year) and Anssi Klapuri (who has an extensive research record on relevant issues). Also, apart from Ali Taylan Cemgil, some of the people working in more Bayesian-based approaches to relevant problems are not mentioned: Chris Raphael (Indiana U), Samer Abdallah (Queen Mary, London), Randall Leistikow (Stanford U), Kunio Kashino (NTT Japan). It could be very interesting to have them on board.<br />
<br />
Regarding evaluation procedures, this contest has the advantage of having a precedent during last year's exercise. I would make a few suggestions from that experience:<br />
* UPF should make available any semi-automatic tool for evaluation used last year.<br />
* Each sound file to be used, should be cross-annotated, and the variability between annotations should be used for the evaluation.<br />
* 2 or more voice arrangements should be eliminated from the training/test set. In those there is no clear definition of the melody to be extracted.<br />
* There should be a separate evaluation for melody segmentation: how well the algorithm separates those excerpts containing melodic parts from those that are purely background. The evaluation can be similar to the one Marolt's paper for DAFx04.<br />
I would recommend the organizers to contact Emilia Gomez, Sebastian Strecht and Bee-Suan Ong from UPF, about last year's experience. We should learn from that experience and improve where necessary.<br />
<br />
Using the RWC database, Magnatunes and other similar collections, could help to expand the training and test sets. The organizers will need to coordinate a wide effort to expand on the currently existing contest database. Melody annotation is very complex and quite time-consuming, so only through a concerted effort will a proper test set be developed.<br />
The organizers could also contact Michele Lessaffre in Ghent, about their annotations efforts in the past (see ISMIR 2004).</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Key_Finding&diff=3462005:Audio Key Finding2005-02-01T20:36:37Z<p>138.37.33.58: /* Review 2 */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for MIDI data. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments. <br />
<br />
<br />
==Potential Participants==<br />
<br />
#Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].<br />
#Steffen Pauws (steffen.pauws@philips.com): [high].<br />
#Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
#Ozgur Izmirli (oizm@conncoll.edu): [moderate].<br />
#Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
==Review 2==<br />
<br />
General comments:<br />
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1<br />
Description Paragraph: Par 2, Line 2 - sentence requires correction<br />
<br />
The problem is well defined and the mentioned possible participants seem likely to participate.<br />
<br />
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)<br />
Assumption of closeness:<br />
* Perfect 5th: Is this generally accepted as an almost similar key?<br />
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field) <br />
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?<br />
<br />
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Key_Finding&diff=3452005:Audio Key Finding2005-02-01T20:36:05Z<p>138.37.33.58: /* Review 1 */</p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for MIDI data. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments. <br />
<br />
<br />
==Potential Participants==<br />
<br />
#Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].<br />
#Steffen Pauws (steffen.pauws@philips.com): [high].<br />
#Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
#Ozgur Izmirli (oizm@conncoll.edu): [moderate].<br />
#Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
<br />
==Review 1==<br />
<br />
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.<br />
<br />
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known.<br />
<br />
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.<br />
<br />
Will it be some training data, so that participants can try their algorithms?<br />
<br />
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins<br />
<br />
==Review 2==</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Key_Finding&diff=3442005:Audio Key Finding2005-02-01T20:34:49Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu<br />
<br />
<br />
==Title==<br />
<br />
Evaluation of Key Finding Algorithms<br />
<br />
<br />
==Description==<br />
<br />
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.<br />
<br />
There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for MIDI data. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments. <br />
<br />
<br />
==Potential Participants==<br />
<br />
#Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].<br />
#Steffen Pauws (steffen.pauws@philips.com): [high].<br />
#Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].<br />
#Ozgur Izmirli (oizm@conncoll.edu): [moderate].<br />
#Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].<br />
<br />
<br />
==Evaluation Procedures==<br />
<br />
The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.<br />
<br />
The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.<br />
<br />
Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.<br />
<br />
<br />
==Review 1==<br />
<br />
==Review 2==</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Artist&diff=822005:Audio Artist2005-02-01T20:33:48Z<p>138.37.33.58: /* Review 2 */</p>
<hr />
<div>==Proposer==<br />
<br />
Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk<br />
<br />
<br />
==Title==<br />
<br />
Artist or group identification from musical audio.<br />
<br />
<br />
==Description==<br />
<br />
The automatic artist identification of musical audio.<br />
<br />
1) Input data<br />
The input for this task is a set of sound file excerpts adhering to the format, meta data and content requirements mentioned below.<br />
<br />
Audio format:<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* Either whole files or 1 minute excerpts<br />
<br />
Audio content:<br />
* Any type of music<br />
* data set should include at least 25 different artists or groups working in any genre<br />
* both live performances and sequenced music are eligible<br />
* Each artist should be represented by a minimum of 10 examples. If possible the same number of examples should represent each artist.<br />
* If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.<br />
* Would be good to enforce some sort of cross-album component for the actual contest to avoid producer detection<br />
<br />
Metadata:<br />
* By definition each example must have an artist or group label corresponding to one of the output classes.<br />
* It is assumed that artist labels will be correct, however, where possible existing artist labels should be confirmed by two or more non-entrants, due to IP constraints it is unlikely that we will be allowed to distribute any database for metadata validation by participants. This validation should ensure that each artist or group has a single label which is applied to all of their examples and that any conflicts, such as an artist also belonging to a group also represented within the data, are resolved/removed for simplicity. Other possibilities include allowing multiple artist labels, and requiring submissions to identify each label, with the final score divided evenly among the labels (I doubt there is demand for this).<br />
* The training set should be defined by a text file with one entry per line, in the following format:<br />
<example path and filename>\t<genre label>\n<br />
<br />
2) Output results<br />
Results should be output into a text file with one entry per line in the following format:<br />
<example path and filename>\t<genre classification>\n<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, Medium<br />
* Elias Pampalk (ÖFAI), elias@oefai.at, Medium<br />
* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, Medium<br />
* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High<br />
* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, Medium<br />
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium<br />
* François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium<br />
<br />
<br />
==Evaluation Procedures==<br />
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.<br />
<br />
Evaluation measures:<br />
* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).<br />
* Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.<br />
* Perhaps specify different class #s (1-in-10, 1-in-50, 1-in-1000) to test scaling and robustness among different implementations<br />
<br />
Evaluation framework:<br />
<br />
Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early February for submission development.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
(Note potentially significant data overlap between this task and genre classification competition)<br />
Re-use Magnatune database (???)<br />
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)<br />
Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments)<br />
Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites<br />
<br />
Ground truth annotations:<br />
<br />
All annotations should be validated, to ensure homogenenuity of artist labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be extended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.<br />
<br />
==Review 1==<br />
<br />
==Review 2==<br />
<br />
This proposal is very interesting and it is one the most well defined. Indeed it seems quite straightforward to establish the ground truth and to evaluate the results.<br />
<br />
The mentioned participants really belong to the field. People working on voice separation could be added, such as Feng, Zhuang & Pan and Tsai & Wang.<br />
<br />
The test data are also relevant and seem easy to obtain. The RWC database could also provide some data. However I don't think that data synthesized from MIDI can be used (to avoid the "MIDI-producer" detection).<br />
<br />
My main concern is about the range of genres spanned by the data. Indeed, if most data come from different genres, the problem becomes far easier and less relevant. I believe that artist identification and artist similarity (which is close to genre classification) are very different queries, and that artist identification is relevant only within a given genre.<br />
Thus I would like to perform the evaluation on one of two sets of artists belonging to a single genre (say classical or rock) and containing some very similar artists (say Mozart/Haydn/Gluck or The beatles/The rolling stones/The who).</div>138.37.33.58https://music-ir.org/mirex/w/index.php?title=2005:Audio_Artist&diff=812005:Audio Artist2005-02-01T20:32:47Z<p>138.37.33.58: </p>
<hr />
<div>==Proposer==<br />
<br />
Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk<br />
<br />
<br />
==Title==<br />
<br />
Artist or group identification from musical audio.<br />
<br />
<br />
==Description==<br />
<br />
The automatic artist identification of musical audio.<br />
<br />
1) Input data<br />
The input for this task is a set of sound file excerpts adhering to the format, meta data and content requirements mentioned below.<br />
<br />
Audio format:<br />
* CD-quality (PCM, 16-bit, 44100 Hz)<br />
* single channel (mono)<br />
* Either whole files or 1 minute excerpts<br />
<br />
Audio content:<br />
* Any type of music<br />
* data set should include at least 25 different artists or groups working in any genre<br />
* both live performances and sequenced music are eligible<br />
* Each artist should be represented by a minimum of 10 examples. If possible the same number of examples should represent each artist.<br />
* If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.<br />
* Would be good to enforce some sort of cross-album component for the actual contest to avoid producer detection<br />
<br />
Metadata:<br />
* By definition each example must have an artist or group label corresponding to one of the output classes.<br />
* It is assumed that artist labels will be correct, however, where possible existing artist labels should be confirmed by two or more non-entrants, due to IP constraints it is unlikely that we will be allowed to distribute any database for metadata validation by participants. This validation should ensure that each artist or group has a single label which is applied to all of their examples and that any conflicts, such as an artist also belonging to a group also represented within the data, are resolved/removed for simplicity. Other possibilities include allowing multiple artist labels, and requiring submissions to identify each label, with the final score divided evenly among the labels (I doubt there is demand for this).<br />
* The training set should be defined by a text file with one entry per line, in the following format:<br />
<example path and filename>\t<genre label>\n<br />
<br />
2) Output results<br />
Results should be output into a text file with one entry per line in the following format:<br />
<example path and filename>\t<genre classification>\n<br />
<br />
<br />
==Potential Participants==<br />
<br />
* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, Medium<br />
* Elias Pampalk (ÖFAI), elias@oefai.at, Medium<br />
* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, Medium<br />
* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High<br />
* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, Medium<br />
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium<br />
* François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium<br />
<br />
<br />
==Evaluation Procedures==<br />
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.<br />
<br />
Evaluation measures:<br />
* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).<br />
* Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.<br />
* Perhaps specify different class #s (1-in-10, 1-in-50, 1-in-1000) to test scaling and robustness among different implementations<br />
<br />
Evaluation framework:<br />
<br />
Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early February for submission development.<br />
<br />
<br />
==Relevant Test Collections==<br />
<br />
(Note potentially significant data overlap between this task and genre classification competition)<br />
Re-use Magnatune database (???)<br />
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)<br />
Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments)<br />
Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites<br />
<br />
Ground truth annotations:<br />
<br />
All annotations should be validated, to ensure homogenenuity of artist labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be extended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.<br />
<br />
==Review 1==<br />
<br />
==Review 2==</div>138.37.33.58