Difference between revisions of "2005:Audio Artist"

From MIREX Wiki
(Review 1)
(Potential Participants)
Line 49: Line 49:
 
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
 
* Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
 
* François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium
 
* François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium
 
+
* Nicolas Scaringella (EPFL), nicolas.scaringella@epfl.ch, High
  
 
==Evaluation Procedures==
 
==Evaluation Procedures==

Revision as of 12:17, 3 February 2005

Proposer

Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk


Title

Artist or group identification from musical audio.


Description

The automatic artist identification of musical audio.

1) Input data The input for this task is a set of sound file excerpts adhering to the format, meta data and content requirements mentioned below.

Audio format:

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • Either whole files or 1 minute excerpts

Audio content:

  • Any type of music
  • data set should include at least 25 different artists or groups working in any genre
  • both live performances and sequenced music are eligible
  • Each artist should be represented by a minimum of 10 examples. If possible the same number of examples should represent each artist.
  • If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.
  • Would be good to enforce some sort of cross-album component for the actual contest to avoid producer detection

Metadata:

  • By definition each example must have an artist or group label corresponding to one of the output classes.
  • It is assumed that artist labels will be correct, however, where possible existing artist labels should be confirmed by two or more non-entrants, due to IP constraints it is unlikely that we will be allowed to distribute any database for metadata validation by participants. This validation should ensure that each artist or group has a single label which is applied to all of their examples and that any conflicts, such as an artist also belonging to a group also represented within the data, are resolved/removed for simplicity. Other possibilities include allowing multiple artist labels, and requiring submissions to identify each label, with the final score divided evenly among the labels (I doubt there is demand for this).
  • The training set should be defined by a text file with one entry per line, in the following format:

<example path and filename>\t<genre label>\n

2) Output results Results should be output into a text file with one entry per line in the following format: <example path and filename>\t<genre classification>\n


Potential Participants

  • Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, Medium
  • Elias Pampalk (├ûFAI), elias@oefai.at, Medium
  • George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, Medium
  • Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High
  • Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, Medium
  • Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
  • Fran├ºois Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium
  • Nicolas Scaringella (EPFL), nicolas.scaringella@epfl.ch, High

Evaluation Procedures

3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.

Evaluation measures:

  • Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).
  • Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.
  • Perhaps specify different class #s (1-in-10, 1-in-50, 1-in-1000) to test scaling and robustness among different implementations

Evaluation framework:

Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early February for submission development.


Relevant Test Collections

(Note potentially significant data overlap between this task and genre classification competition) Re-use Magnatune database (???) Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons) Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments) Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites

Ground truth annotations:

All annotations should be validated, to ensure homogenenuity of artist labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be extended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.


Review 1

The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples. Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.

The tasks are well-defined, easily understood, and appear to have some practical importance. The evaluation and testing procedures are very good. This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.

My only comments relate to the choice of ground truth data. For the artist ID task I think we should use real, commercial recordings, since there is no shortage of them, and the artist ground truth is easily defined. I do not think it is important to have independent verification of the ground truth, since there will be enough examples to ensure that a few questionable cases won't much hurt the overall performance, and in any case all we really care about is comparative performance. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution. If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders. (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)

My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.

In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results. In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete. Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat.

Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions. However, having each site run their own classifiers is a huge win in terms of the logistics of running the test. I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.

I am particularly interested in a set of contrast conditions for different scales of problem - 1 in 10, 1 in 50, 1 in 100 etc. Most artist ID tasks have been on very small subsets of 'all possible artists', and it would be interesting to see if there are differences in how different approaches scale (e.g. that only some techniques are tractable for very large sets).

I also think that a cross-album condition is particularly interesting. Again, this could be a contrast: for each artist, have training data from albums A and B, then have (disjoint) test data from albums B and C, and compare the accuracy on both cases to see how strong the 'producer effect' (or within-album similarity) really is.

I'm not sure how important the M2K/D2K angle is. It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation. By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding.

I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms. It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).

Review 2

This proposal is very interesting and it is one the most well defined. Indeed it seems quite straightforward to establish the ground truth and to evaluate the results.

The mentioned participants really belong to the field. People working on voice separation could be added, such as Feng, Zhuang & Pan and Tsai & Wang.

The test data are also relevant and seem easy to obtain. The RWC database could also provide some data. However I don't think that data synthesized from MIDI can be used (to avoid the "MIDI-producer" detection).

My main concern is about the range of genres spanned by the data. Indeed, if most data come from different genres, the problem becomes far easier and less relevant. I believe that artist identification and artist similarity (which is close to genre classification) are very different queries, and that artist identification is relevant only within a given genre. Thus I would like to perform the evaluation on one of two sets of artists belonging to a single genre (say classical or rock) and containing some very similar artists (say Mozart/Haydn/Gluck or The beatles/The rolling stones/The who).

Downie's Comments

Review #2 does raise the interesting point of too much spread in the "genre" aspect. I do see how it could turn into a genre task if not thought out. Would be interesting to also add in the idea of "covers": same pieces but performed by different artists. Maybe, if possible, a mix of "live" and "studio" recordings of same pieces if available?

Some questions:

1. Why PCM? Why mono? Why not MP3? Am being a bit of a weeny, but I am interested.

2. Do we **really* need to supply the training set? Being both provocative and pragmatic with this question.