Difference between revisions of "2005:Audio Artist"

From MIREX Wiki
(Evaluation Procedures)
m (53 revisions)
 
(15 intermediate revisions by 5 users not shown)
(No difference)

Latest revision as of 16:18, 9 May 2010

Proposer

Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk


Title

Artist or group identification from musical audio.


Description

The automatic artist identification of musical audio.

1) Input data The input for this task is a set of sound file excerpts adhering to the format, meta data and content requirements mentioned below.

Audio format:

  • CD-quality (Wave, 16-bit, 44100 Hz or 22050 Hz, Mono or Stereo)
  • Whole files, algorithms may use segments at authors discretion

Audio content:

  • 3 databases: Epitonic, Magantune and USPOP2002
  • data set should include at least 75 different artists or groups working in any genre
  • both live performances and sequenced music are eligible
  • Each artist should be represented by a minimum of 10 examples.
  • Would be good to enforce some sort of cross-album component for the actual contest to avoid producer detection
  • A tuning database will NOT be provided. However the RWC Magnatune database used for the 2004 Audio desciption contest is still available (Training part 1 [1], Training part 2 [2])

Metadata:

  • By definition each example must have an artist or group label corresponding to one of the output classes.
  • It is assumed that artist labels will be correct
  • The genre label may also be supplied
  • The training set should be defined by a text file with one entry per line, in the following format (<> should be omitted, used here for clarity):
    <example path and filename>\t<artist label>\t<genre label>\n

2) Output results

  • Results should be output into a text file with one entry per line in the following format:
    <example path and filename>\t<artist classification>\n

3) Maximum running time

  • The maximum running time for a single iteration of a submitted algorithm will be 24 hours (allowing a maximum of 72 hours for 3-fold cross-validation)

Participants

  • Kris West (University of East Anglia), kw@cmp.uea.ac.uk
  • Elias Pampalk (├ûFAI), elias@oefai.at
  • James Bergstra and Douglas Eck (University of Montreal), james.bergstra@umontreal.ca, eckdoug@iro.umontreal.ca
  • Michael Mandel and Dan Ellis (Columbia University), mim@ee.columbia.edu, dpwe@ee.columbia.edu
  • Thomas Lidy and Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at
  • Beth Logan (HP), beth.logan@hp.com
  • George Tzanetakis (University of Victoria), gtzan@cs.uvic.ca
  • Vitor Soares (University of Porto), vitor.soares@semanticaudiolabs.org

Other Potential Participants

  • Nicolas Scaringella (EPFL), nicolas.scaringella@epfl.ch
  • Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es
  • Fran├ºois Pachet (Sony CSL-Paris), pachet@csl.sony.fr
  • McKinney and Breebart (Philips research labs), martin.mckinney@philips.com, jeroen.breebaart@philips.com
  • Jaume Masip-Torne (University Pompeu Fabra), jmasip@iua.upf.es

Evaluation Procedures

3 independent runs on 3 databases

Evaluation measures:

  • Unnormalised accuracy (diagonal sum of unnormalised confusion matrix) and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).
  • Normalised confusion matrix.
  • Normalised diagonal sum of confusion matrix.
  • Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.

Evaluation framework:

Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release due March 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available March for submission development.

Format for algorithm calls

There are four formats for calls to code external to D2K that will be supported:

  • CommandName inputFileNameAndPath outputFileNameAndPath
  • CommandName inputFileNameAndPath (ouput file name created by adding an extension, e.g. ".features")

The second two formats allow an additional file to be passed as a parameter:

  • CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath
  • CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath (ouput file name created by adding an extension to inputFileNameAndPath1, e.g. ".features")

E.g.
ExtractFeatures C:\inTrainFiles.txt C:\outTrainFeatures.feat
ExtractFeatures C:\inTestFiles.txt C:\outTestFeatures.feat
TrainModel C:\outTrainFeatures.feat
ApplyModel C:\outTrainFeatures.feat.model C:\outTestFeatures.feat C:\results.txt

Relevant Test Collections

(Note potentially significant data overlap between this task and genre classification competition) Re-use Magnatune database (???) Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons) Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments) Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites

Ground truth annotations:

Will assume artist labelling is correct, although homogeneous labels for a single artist will have to be confirmed (i.e. will be checked for spelling errors etc.)


Review 1

The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples. Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.

The tasks are well-defined, easily understood, and appear to have some practical importance. The evaluation and testing procedures are very good. This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.

My only comments relate to the choice of ground truth data. For the artist ID task I think we should use real, commercial recordings, since there is no shortage of them, and the artist ground truth is easily defined. I do not think it is important to have independent verification of the ground truth, since there will be enough examples to ensure that a few questionable cases won't much hurt the overall performance, and in any case all we really care about is comparative performance. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution. If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders. (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)

My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.

In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results. In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete. Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat.

Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions. However, having each site run their own classifiers is a huge win in terms of the logistics of running the test. I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.

I am particularly interested in a set of contrast conditions for different scales of problem - 1 in 10, 1 in 50, 1 in 100 etc. Most artist ID tasks have been on very small subsets of 'all possible artists', and it would be interesting to see if there are differences in how different approaches scale (e.g. that only some techniques are tractable for very large sets).

I also think that a cross-album condition is particularly interesting. Again, this could be a contrast: for each artist, have training data from albums A and B, then have (disjoint) test data from albums B and C, and compare the accuracy on both cases to see how strong the 'producer effect' (or within-album similarity) really is.

I'm not sure how important the M2K/D2K angle is. It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation. By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding.

I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms. It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).

Review 2

This proposal is very interesting and it is one the most well defined. Indeed it seems quite straightforward to establish the ground truth and to evaluate the results.

The mentioned participants really belong to the field. People working on voice separation could be added, such as Feng, Zhuang & Pan and Tsai & Wang.

The test data are also relevant and seem easy to obtain. The RWC database could also provide some data. However I don't think that data synthesized from MIDI can be used (to avoid the "MIDI-producer" detection).

My main concern is about the range of genres spanned by the data. Indeed, if most data come from different genres, the problem becomes far easier and less relevant. I believe that artist identification and artist similarity (which is close to genre classification) are very different queries, and that artist identification is relevant only within a given genre. Thus I would like to perform the evaluation on one of two sets of artists belonging to a single genre (say classical or rock) and containing some very similar artists (say Mozart/Haydn/Gluck or The beatles/The rolling stones/The who).

Downie's Comments

Review #2 does raise the interesting point of too much spread in the "genre" aspect. I do see how it could turn into a genre task if not thought out. Would be interesting to also add in the idea of "covers": same pieces but performed by different artists. Maybe, if possible, a mix of "live" and "studio" recordings of same pieces if available?

Some questions:

1. Why PCM? Why mono? Why not MP3? Am being a bit of a weeny, but I am interested.

2. Do we **really* need to supply the training set? Being both provocative and pragmatic with this question.

Kris' thoughts

Contents:

  1. Multiple genres and Artist ID
  2. Framework issues and algorithm submission
  3. Producing ground truth and answers to Downie's comments
  4. Who has data?

1. Multiple genres and Artist ID

Dan Ellis wrote:

> About multiple genre classification: I have pretty serious doubts
> about genre classification in the first place, because of the
> seemingly arbitrary nature of the classes and how they are assigned.

IMHO the genre classification task is to reproduce an arbitrary set of culturally assigned classes. Tim Pohle (at the ISMIR 2004 grad school) gave an interesting talk on using genre classifiers to reproduce arbitrary, user assigned classes, to manage a user's personal music collection. We also discussed how to suggest new music choices from a larger catalog by thresholding the probabilities of membership of new music to favored classes.

Dan Ellis wrote:

> This is why I prefer artist identification as a task. That said,
> assigning multiple genres seems not much worse, but not much better
> either. Allowing for fuzzy, multiple characteristics seems to address
> some of the problems with genres -- which is good -- but now defining
> the ground truth is even more arguable and problematic, since we now
> have that much more ground truth data to define -- degree of
> membership, and over a larger set of classes.

I also prefer the artist ID task, but for different reasons; I think we use too few classes to properly evaluate the Genre classification task as some models/techniques fall over if given too many classes to evaluate. Obviously this has come about because of storage, IP and ground-truth constraints. However, if a hierarchy is used (as suggested for the symbolic track), rather than a bag of labels, the ground-truth problem is no bigger as higher level labels can be interpolated and it will be easier to both expand the database to include more pieces and to implement a finer granuality of labels (more sub-genres) in later evaluations. Small taxonomies are the biggest hurdle in the accurate evaluation of Genre classification systems, we can probably define around 10 lowest level classes for this year, but should aim to add the same number again next year and the year after until we can be confident that we have a database that poses a classification problem that is as difficult as a real world application (such as organizing/classifying the whole Epitonic catalog).

Dan Ellis wrote:

> One of the reasons I am interested in a parallel evaluation of genre
> classification and artist ID is that it may provide some objective
> evidence for my gut bias against genres: if the results of different
> algorithms under genre classification are more equivocal than artist
> ID (i.e. they reveal less about the difference between the
> algorithms), then that's some kind of evidence that the task itself is
> ill-posed. My suspicion is that multiple, real-valued genre
> memberships will be even less revealing.

I also believe that many classification techniques and feature extractions are vulnerable to a smaller numbers of examples per class and I think this is far more likely to show up in the comparison of the performance of an algorithm between the two tracks (my own submission will be modified for the artist id track). Artist identification is about modeling a natural grouping within the data, whereas genres are not neccessarily natural groupings and I believe their accurate modeling of hierarchical, multi-modal genres is likely to be more complex than that of modeling an artist's work (although this is alleviated by the additional data available). An artist may work in a number of styles but there is *usually* some dimension along which all the examples are grouped.

Dan Ellis wrote:

> The most important thing, I think, is to define the evaluation to support
>(and encourage) the largest number of participants, meaning that we could
>include this as an option, but also evaluate a 1-best genre ID to remain
>accessible to algorithms that intrinsically can only report one class.

With this in mind I think we should opt for a hierarchical taxonomy, which can support direct comparison of hierarchical classifiers, single label classifiers (by interpolating higher level classifications in evaluation framework) and multiple label classifiers (in a somewhat limited fashion, perhaps with a penalty for additional incorrect labels, which is probably not fair, or by limiting number of labels to match height of taxonomy). I suggest that each correct label scores one point, e.g. rock/pop, rock, indie rock would score 3 if all labels are correct.


2. Framework issues and algorithm submission

I don't think it is particularly ambitious to have people submit their code for evaluation at a single local. I have already implemented a basic D2K framework that can run anything that will run from the command line including Matlab. The only constraint is that a submission will have to conform to a simple text file input and output format. Marsyas-0.1 and Matlab examples have been produced and I am happy for people to take in IO portions directly from this code if they wish. Having code submitted to a central evaluation site will allow us to perform cross-validation experiments and assess exactly how much variance their is in each system's performance. This would not be so essential if we had a very large data set (min 10,000 examples) however we are going to get nowhere near that many (maybe in later years...). It was also suggested in the reviews that this would hamstring feature selection techniques (see review 2) but I don't believe this, surely the feature selection code (including any classifier used) would be correctly implemented in the feature extraction phase.

I could also define an optional simple text file format for descriptors. This would allow the hybridization of any submitted systems using this format and the use of a bench mark classifier to evaluate the power of the descriptors and classifiers independent of each other. I would be happy to provide several bench mark classifiers for this purpose (possibly by creating an interface to Weka). I would also be interested in seeing the performance of a mixture of experts system, built from all the submitted systems, which should, in theory, be able to equal or better the performance of all of the submitted systems.

M2K is coming up to its Alpha release and will include a cut down version of the competition framework so that people can see how it works (External integration itinerary). As D2K can run across X windows we could even provide a virtual lab evaluation setup, so that each participant could run their own submission (without violating any IP laws) if they really wanted, and ensure that it ran ok. Anyone can get a license for D2K and the framework will probably be included in a later version of M2K so anyone can make sure that their submission works wok ahead of time.


3. Producing ground truth and answers to Downie's comments

First I don't think we need to send out a tuning database, it creates problems and solves none. If data is held at evaluation site we don't have any SIP issues and as I stated earlier, anyone could launch their submission themselves in D2K across an X Windows session (note all console output from code external to D2K is collected and forwarded to D2K console to aid debugging). If we use SIP free databases, we are unlikely to be able validate ground-truth with on line services such as [3] and it has also been suggested that SIP-free debases are not necessarily representative of the whole music community. Several people have said that it doesn't matter if some of labels are incorrect however I'm not afraid to volunteer to validate the labels of a subset of the data (say 200 files, humans can get through them quicker than you'd think) and if there were sufficient volunteers it would go a long way to establishing an IP free research dbase with good ground-truth (if I don't get any volunteers I won't consider this an option, so email me!).

Personally I think we should use a large volume of copyrighted material, with labels confirmed by at least two sources (existing dbase label, CDDB and allmediaguide, or a human labeler). The format should be WAV (MP3s would have to be decoded to this anyway) and will be mono unless anyone specifically requests stereo (both can be made available or can be handled by framework).

Should we rename this the Style classification task?


4. Who has data? Anyone with music (with or without ground-truth) that we can use should make themselves known ASAP. I can provide a fair selection of white label (IP-free) dance music in at least 3 subgenres, with labels defined by 3 expert listeners.