MIREX Wiki - User contributions [en]

2005:Main Page

2005-02-02T20:04:26Z

128.174.154.95:

==Welcome to the MIREX Wiki.==

* MIREX 2005 (2nd Annual Music Information Retrieval Evaluation eXchange): https://www.music-ir.org/evaluation/MIREX/index.html
* Call for evaluation topic: https://www.music-ir.org/evaluation/MIREX/call_for_evaluation_topics.html
* Call for test data and evaluation procedures: https://www.music-ir.org/evaluation/MIREX/call_for_data_and_procedures.html

==Topics==

* [[Audio Artist Identification]]
* [[Audio Drum Detection]]
* [[Audio Genre Classification]]
* [[Audio Key Finding]]
* [[Audio Melody Extraction]]
* [[Audio Onset Detection]]
* [[Audio Tempo Extraction]]
* [[Symbolic Genre Classification]]
* [[Symbolic Key Finding]]
* [[Symbolic Melodic Similarity]]

==Editing Resources==

Please see:

* MediaWiki: [http://meta.wikipedia.org/wiki/MediaWiki_User%27s_Guide User's Guide]
* MediaWiki: [http://www.wikipedia.org/wiki/Help:Editing Editing Help]

==Other External Links==

*M2K: https://music-ir.org/evaluation/m2k/index.html
*M2K modules webpage: https://music-ir.org/evaluation/m2k/module_listing.html
*M2K Modules Wiki: https://www.music-ir.org/modules
*The Tools We Use: https://music-ir.org/evaluation/tools.html
*IMIRSEL: https://music-ir.org/evaluation/
*Music-IR Bibliography: https://music-ir.org/research_home.html
*Music-IR.org: https://music-ir.org/

2005:Audio Onset Detect

2005-02-01T21:55:42Z

128.174.154.95:

==Proposer==

Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk

==Title==

Onset Detection Contest

==Description==

The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.

1) Input data
Audio format:
The data will be monophonic sound files, with the associated onset times and
data about the annotation robustness.
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono)
* the file length is not critical for that task, but 30 seconds max. excerpts would be convenient if we want to have a correct diversity in the dataset. It must be reminded that real-world sounds must be manually annotated (painful and time-consuming task, as pointed by J. Bello at MIREX 2004).

Audio content:
The dataset will be subdivided into classes. This idea has been evoked by D. Ellis at last MIREX. The reasons why:
* onset detection are performed in various applications, some of them are dedicated for a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...)
* the composition of the entire database can determine the relative rank of the onset detection algorithms. For example, an evaluation of a dataset principally composed of complex mixes will not emphasize an onset detection performing well on solo phrases of bowed strings, but a little less than the others on complex mixes.
* it can show the weak points of the compared methods. I think it is more useful than an evaluation based on an overall success percentage or curve.
Suggestions for such classes:
We can define 2 types of subdivisions:
* monophonic instruments solo phrases
* polyphonic instruments solo phrases
* complex mixes
Or, as suggested by Bello and al.:
* pitched percussive instruments phrases
* pitched non-percussive instruments phrases
* non-pitched percussive instruments phrases
* complex mixes

Meta data:
Two types of annotation can be provided:
* Manual annotation for the real word sounds. For this type of annotation, our article mentions these potential difficulties:
* Midi score for synthesized sounds or MIDI commanded instruments. They are considered as robust ground-truth.

Notes on annotation:
As mentioned above, the sound files will be provided with their onset time annotation. The ground-truth we will define can be critical for the evaluation.
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.
For real world sounds, annotation volunteers are needed. The annotations should be cross-validated (errare humanum est). Precise instructions on which events to annotate must be given to the annotators.
Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.How the annotation is taken into account must be precisely defined... my opinion is to discard sound events that are not music notes, for example breathing, key strokes etc..., that are quite frequent in the solo recordings, even if they're detected by most of the onset detection algorithms...

Article and matlab tool for annotation by Pierre Leveau et al.

http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf

http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm

2) Output data
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.

==Potential Participants==

* Tampere University of Technnology, Audio Research Group
Ansii Klapuri <klap@cs.tut.fi>
* MIT, MediaLab
Tristan Jehan <tristan@medialab.mit.edu>
* LAM, France
Pierre Leveau <leveau@lam.jussieu.fr>
Laurent Daudet <daudet@lam.jussieu.fr>
* IRCAM, France
Xavier Rodet <rod@ircam.fr>
* University of Pompeo Fabra, Multimedia Technology Group
Julien Ricard <jricard@iua.upf.es>
Fabien Gouyon <fgouyon@iua.upf.es>
* Queen Mary College, Centre for Digital Music
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk>
Paul Brossier <paul.brossier@qmul.elec.ac.uk>

==Evaluation Procedures==

The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a correct detection. If not, it is a false positive.
Evaluation measures:
* percentage of correct detections / false positives (can also be expressed as precision/recall)
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.
* separate scoring for different instrument types (percussive, strings, winds)
More detailed data:
* percentage of doubled detections
* speed measurements of the algorithms
* scalability to large files
* robustness to noise, loudness

==Relevant Test Collections==

Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...

==Review 1==

Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.

In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%.
The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.

It does not mention whether there will be training data available to participants.
To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.

I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.

==Review 2==

Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.

The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.

The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.

The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.

Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.

2005:Audio Melody Extr

2005-02-01T21:55:21Z

128.174.154.95:

==Proposer==

Graham Poliner (Columbia University) graham@ee.columbia.edu

==Title==

Melody Extraction of Polyphonic Audio

==Description==

The melodic content of polyphonic audio provides an intuitive representation for summarization and retrieval. Numerous potential approaches exist for automated melody extraction; therefore, the MIREX 2005 Melody Extraction Evaluation seeks to compare the accuracy of state-of-the-art melody transcription algorithms. The evaluation data set will consist of an eclectic collection of audio excerpts along with the corresponding frame-based transcription of the dominant voice. The performance of the submitted algorithms will be evaluated based on the percentage of frames correctly transcribed.

==Potential Participants==

*Juan P. Bello - juan.bello-correa@elec.qmul.ac.uk - Very Likely
*Ali Taylan Cemgil - cemgil@science.uva.nl - Moderately Likely
*Emilia Gomez - emilia.gomez@iua.upf.es - Likely
*Masataka Goto - m.goto@aist.go.jp - Moderately Likely
*Jana Eggink - j.eggink@dcs.shef.ac.uk - Moderately Likely
*Anssi Klapuri - klap@cs.tut.fi - Moderately Likely
*Matija Marolt - matija.marolt@fri.uni-lj.si - Moderately Likely
*Rui Pedro Paiva - ruipedro@dei.uc.pt - Very Likely
*Graham Poliner - graham@ee.columbia.edu - Very Likely
*Sven Tappert - s_tappert@yahoo.de - Very Likely

==Evaluation Procedures==

Following the evaluation procedure specified for the ISMIR 2004 Melody Contest
*Option 1 - A frame-based comparison between the predicted and reference melody
The total prediction accuracy may be computed by calculating the average absolute difference for each frame where a maximal error is defined as one semitone = 100 cents and a value of 0 Hz may be assigned to unvoiced segments.
*Option 2 - A frame-based comparison between the predicted and reference melody over a one-octave range
This option is the same as Option 1; however, the predicted melody and reference melody are mapped into the range of one octave before calculating the absolute difference.
*Option 3 - Edit distance between the estimated melody and the correct melody
Following the edit distance calculation outlined in Grachten et al. 2002

==Relevant Test Collections==

For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse set of audio segments and corresponding melody transcriptions. Due to the success of the ISMIR 2004 Melody Competition, we recommend that the evaluation set be reused and augmented with additional audio excerpts from such genres as pop, jazz, digital, and opera. The new ground truth may be created by manually correcting the output of current melody transcription algorithms. We may also wish to consider representing the genres in different proportions for the MIREX 2005 evaluation.
The inclusion of popular music may result in additional copyright issues. Copyright law prohibits the universal or unlimited distribution of material on the web. However, if access to the media is limited to MIREX participants, this should be considered a fair use of the copyrighted materials.

==Review 1==

Problem is reasonably well defined and would be considered interesting in terms of current research.

No mention of audio format/sampling rate, will assume:
* CD-quality (CM, 16-bit, 44100 Hz)
* mono
* 30 seconds excerpts
* files are named as "001.wav" to "999.wav"
No mention of frame size or hop size, will this be the same as 2004 competition (Frame size 2048, hop size 256)? Is this optimal? Would some participants prefer to use different sizes. Could the proposed evaluation metrics be modified to use absolute time indexes and a tolerance and therefore be independent of framing?

In the proposed evaluation metrics there is no mention of whether option 1 and option two will be averages as they were last year, or how option 3 will be combined with these.
Statistical significance of differences between submissions should be estimated.

Re-use and augmentation of last year's database is fine, however there is no mention of where new data will come from. Obviously the Magnatune database would be a good source, as this can also be distributed, however it may be best to distribute last years database and hold back new examples. How big should new database be? 50 files? I assume there are likely to be no trained submissions, or they will be pre-trained therefore a single pass over the data should be fine. There is also no mention of how many non-participating transcribers will produce the ground-truth and how differences in transcriptions will be resolved. Given IP status of Magnatune database, distribution to transcribers should not be a problem.

Given the high number of potential participants, I think we can be confident of sufficient participation to run the evaluation.

Recommendation: Significant refinements to proposal and accept.

==Review 2==

This problem is well defined and very relevant to MIR.

The mentioned possible participants are really working in the field. However, the participants marked as "very likely" the same people that participated last year, while some key researchers in the field are modestly marked as "moderately likely". I believe that for this evaluation to be meaningful, the organizers should secure the participation of Masataka Goto (whose PreFest algorithm is still the main reference for melody extraction), Matija Marolt, Jana Eggink (both of whom published relevant work last year) and Anssi Klapuri (who has an extensive research record on relevant issues). Also, apart from Ali Taylan Cemgil, some of the people working in more Bayesian-based approaches to relevant problems are not mentioned: Chris Raphael (Indiana U), Samer Abdallah (Queen Mary, London), Randall Leistikow (Stanford U), Kunio Kashino (NTT Japan). It could be very interesting to have them on board.

Regarding evaluation procedures, this contest has the advantage of having a precedent during last year's exercise. I would make a few suggestions from that experience:
* UPF should make available any semi-automatic tool for evaluation used last year.
* Each sound file to be used, should be cross-annotated, and the variability between annotations should be used for the evaluation.
* 2 or more voice arrangements should be eliminated from the training/test set. In those there is no clear definition of the melody to be extracted.
* There should be a separate evaluation for melody segmentation: how well the algorithm separates those excerpts containing melodic parts from those that are purely background. The evaluation can be similar to the one Marolt's paper for DAFx04.
I would recommend the organizers to contact Emilia Gomez, Sebastian Strecht and Bee-Suan Ong from UPF, about last year's experience. We should learn from that experience and improve where necessary.

Using the RWC database, Magnatunes and other similar collections, could help to expand the training and test sets. The organizers will need to coordinate a wide effort to expand on the currently existing contest database. Melody annotation is very complex and quite time-consuming, so only through a concerted effort will a proper test set be developed.
The organizers could also contact Michele Lessaffre in Ghent, about their annotations efforts in the past (see ISMIR 2004).

2005:Audio Key Finding

2005-02-01T21:55:07Z

128.174.154.95:

==Proposer==

Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) mardiros@usc.edu

==Title==

Evaluation of Key Finding Algorithms

==Description==

Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose the evaluation of key-finding algorithms at the 2005 MIREX.

There are significant contributions in the area of key finding for both audio and symbolic representation. Thus another the same contest was also proposed for MIDI data. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.

==Potential Participants==

* Emilia G├│mez (egomez@iua.upf.es) and Perfecto Herrera (perfecto.herrera@iua.upf.es): [high].
* Steffen Pauws (steffen.pauws@philips.com): [high].
* Ching-Hua Chuan (chinghuc@usc.edu) and Elaine Chew (echew@usc.edu): [high].
* Ozgur Izmirli (oizm@conncoll.edu): [moderate].
* Yongwei Zhu (ywzhu@i2r.a-start.edu.sg) and Mohan Kankanhalli (mohan@comp.nus.edu.sg): [unknown].

==Evaluation Procedures==

The following evaluation outline is a general guideline that will be compatible with both audio and symbolic key finding algorithms. It is safe to assume that each key finding algorithm will have its own set of parameters. The creators of the system should pre-determine the optimal settings for the parameters. Once these settings are determined, an accuracy rate may be calculated. The input of the test should be some excerpt of the pieces in the test set and the output will be the key name, for example, C major or E flat minor. We plan to use pieces for which the keys are known, for example, symphonies and concertos by well-known composers where the keys are stated in the title of the piece. The excerpt will typically be the beginnings of the pieces as this is the only part of the piece for which establishing of the global and known key can be guaranteed.

The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. We will then determine how 'close' each identified key is to the corresponding correct key. Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. It can be assumed that if an algorithm returns a key that is closely related to the actual key then it is superior. We may then use this information to generate further metrics.

Clearly, the optimal parameters may vary for different styles of music, and by composer. If time permits and the systems allow, we may next focus on pieces for which the algorithm has identified an incorrect key under the optimal settings of the parameters and determine whether the incorrect assignments were due to improper parameter selection. We may then calculate the percent of the pieces that had an incorrect assignment under the optimal settings but have a correct assignment with other settings.

==Relevant Test Collections==

Audio data can be obtained from HNH Hong Kong International, Ltd. (http://www.naxos.com), if the agreement with the company is now in effect for MIR testing. We have determined that only fifteen to thirty second excerpts may be sufficient for key finding using audio data. Copyright regulations state that up to 33% of audio files may be copied without any violations of such regulations. This is advantageous since fifteen to thirty second excerpts will be well within this limit.

==Review 1==

The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.

Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known.

Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor". What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.

Will it be some training data, so that participants can try their algorithms?

I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins

==Review 2==

General comments:
Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1
Description Paragraph: Par 2, Line 2 - sentence requires correction

The problem is well defined and the mentioned possible participants seem likely to participate.

Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)
Assumption of closeness:
* Perfect 5th: Is this generally accepted as an almost similar key?
* Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field)
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?

The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?

2005:Audio Genre

2005-02-01T21:54:51Z

128.174.154.95:

==Proposer==

Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk

==Title==

Genre Classification from polyphonic audio.

==Description==

The automatic classification of polyphonic musical audio (in PCM format) into a single high-level genre per example. If there is sufficient demand, a multiple genre track could be defined, requiring submissions to identify each genre (without prior knowledge of the number of labels), with the precision and recall scores calculated for each result.

1) Input data
The input for this task is a set of sound file excerpts adhering to the format, metadata and content requirements mentioned below.

Audio format:
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono)
* Either whole files or 1 minute excerpts

Audio content:
* polyphonic music
* data set should include at least 8 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Ballroom Dance, Electronic/Modern Dance, Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock)
* the classification could also be evaluated in two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).
* both live performances and sequenced music are eligible
* Each class should be represented by a minimum of 100 examples, but 150 would be preferred. If possible the same number of examples should represent each class.
* If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.

Metadata:
* By definition each example must have a genre label corresponding to one of the output classes.
* Where possible existing genre labels should be confirmed by two or more non-entrants, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants.
* The training set should be defined by a text file with one entry per line, in the following format:
<example path and filename>\t<genre label>\n

2) Output results
Results should be output into a text file with one entry per line in the following format:

<example path and filename>\t<genre classification>\n

==Potential Participants==

* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, High
* Elias Pampalk (├ûFAI), elias@oefai.at, High
* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, High
* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High
* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, High
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
* Fran├ºois Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium

==Evaluation Procedures==
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.

Evaluation measures:
* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).
* Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.

Evaluation framework:

Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early Febuary for submission development.

==Relevant Test Collections==

Re-use Magnatune database (???)
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)
Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments)
Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites

Ground truth annotations:

All annotations should be validated, rather than accepting supplied genre labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be exended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.

==Review 1==

==Review 2==

The single genre problem is well defined and seems to be a relevant problem for the MIR community nowadays. Obviously, it would be more relevant to classify each track into multiple genres or to use a hierarchy of genres, but the proposal does not deal with these issues in a satisfying way. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are nodes for Electronic and Jazz/Blues, where lies the leaf Electro-jazz ?
I suggest that the contest concentrates on the well-defined simple genre problem. An interesting development of it would be to ask algorithms to associate a percentage of probability to each predefined genre on each track, instead of outputing a single genre with 100% probability.
Regarding the input format, I think that whole files are better (the total duration and the volume variation are already good genre descriptors) and that polyphony is not required (classical music contains many works for solo instruments).

I have no precise opinion regarding the defined genres, since this is more of a cultural importance. I'm not sure that Rock is less diverse than World (what's the common point between Elvis and Radiohead ?). Also I am surprised that there is no Rap/RnB.
The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Thus I would like this choice to be more discussed by the participants.

The list of participants is relevant. McKinney and Breebart could be added.

It is a good idea to accept many programming languages for submission. However it seems quite difficult to implement the learning phase, because each algorithm may use different structures to store learnt data. For instance, when the algorithm computes descriptors and feeds them through a classifier, is it possible to select the best descriptors ? If not, it is not realistic to suppose that the participant has to do it beforehand on his own limited set of data. Then I see two possibilities: either participants are given 50% of the database and do all the learning work themselves (then no k-fold cross validation is performed), or submissions concern only sets of descriptors and not full classification algorithms. The second choice has the advantage of allowing to compare different sets of descriptors with the same classifiers.

The test data are relevant but still a bit vague. Obviously existing databases should be used again and completed with new annotated data. The participants should list their own databases in detail and put them in common for evaluation in order to evaluate the time needed to annotate new data.

2005:Audio Artist

2005-02-01T21:54:26Z

128.174.154.95:

==Proposer==

Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk

==Title==

Artist or group identification from musical audio.

==Description==

The automatic artist identification of musical audio.

1) Input data
The input for this task is a set of sound file excerpts adhering to the format, meta data and content requirements mentioned below.

Audio format:
* CD-quality (PCM, 16-bit, 44100 Hz)
* single channel (mono)
* Either whole files or 1 minute excerpts

Audio content:
* Any type of music
* data set should include at least 25 different artists or groups working in any genre
* both live performances and sequenced music are eligible
* Each artist should be represented by a minimum of 10 examples. If possible the same number of examples should represent each artist.
* If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.
* Would be good to enforce some sort of cross-album component for the actual contest to avoid producer detection

Metadata:
* By definition each example must have an artist or group label corresponding to one of the output classes.
* It is assumed that artist labels will be correct, however, where possible existing artist labels should be confirmed by two or more non-entrants, due to IP constraints it is unlikely that we will be allowed to distribute any database for metadata validation by participants. This validation should ensure that each artist or group has a single label which is applied to all of their examples and that any conflicts, such as an artist also belonging to a group also represented within the data, are resolved/removed for simplicity. Other possibilities include allowing multiple artist labels, and requiring submissions to identify each label, with the final score divided evenly among the labels (I doubt there is demand for this).
* The training set should be defined by a text file with one entry per line, in the following format:
<example path and filename>\t<genre label>\n

2) Output results
Results should be output into a text file with one entry per line in the following format:
<example path and filename>\t<genre classification>\n

==Potential Participants==

* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, Medium
* Elias Pampalk (├ûFAI), elias@oefai.at, Medium
* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, Medium
* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High
* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, Medium
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
* Fran├ºois Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium

==Evaluation Procedures==
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.

Evaluation measures:
* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).
* Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.
* Perhaps specify different class #s (1-in-10, 1-in-50, 1-in-1000) to test scaling and robustness among different implementations

Evaluation framework:

Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early February for submission development.

==Relevant Test Collections==

(Note potentially significant data overlap between this task and genre classification competition)
Re-use Magnatune database (???)
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)
Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments)
Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites

Ground truth annotations:

All annotations should be validated, to ensure homogenenuity of artist labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be extended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.

==Review 1==

==Review 2==

This proposal is very interesting and it is one the most well defined. Indeed it seems quite straightforward to establish the ground truth and to evaluate the results.

The mentioned participants really belong to the field. People working on voice separation could be added, such as Feng, Zhuang & Pan and Tsai & Wang.

The test data are also relevant and seem easy to obtain. The RWC database could also provide some data. However I don't think that data synthesized from MIDI can be used (to avoid the "MIDI-producer" detection).

My main concern is about the range of genres spanned by the data. Indeed, if most data come from different genres, the problem becomes far easier and less relevant. I believe that artist identification and artist similarity (which is close to genre classification) are very different queries, and that artist identification is relevant only within a given genre.
Thus I would like to perform the evaluation on one of two sets of artists belonging to a single genre (say classical or rock) and containing some very similar artists (say Mozart/Haydn/Gluck or The beatles/The rolling stones/The who).