Difference between revisions of "2005:Audio Genre"

From MIREX Wiki
(Review 1)
Line 78: Line 78:
 
==Review 1==
 
==Review 1==
  
 +
The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples.  Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.
 +
 +
The tasks are well-defined, easily understood, and appear to have some practical importance.  The evaluation and testing procedures are very good.  This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.
 +
 +
My only comments relate to the choice of ground truth data. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution.  If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders.  (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)
 +
 +
My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.
 +
 +
In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results.  In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete.  Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat. 
 +
 +
Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions.  However, having each site run their own classifiers is a huge win in terms of the logistics of running the test.  I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.
 +
 +
I'm not sure how important the M2K/D2K angle is.  It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation.  By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding. 
 +
 +
In terms of the genre contest, the big issue is the unreliability and unclear definitions of the ground truth labels.  It seems weird to have one evaluation on the ability to distinguish an arbitrary set of artists - a very general-sounding problem - and another contest which is specifically dominated by the ability to distinguish classical from jazz from rock - a very specific, and perhaps not very important, problem. 
 +
 +
Again in this case I don't particularly like the idea of trying to get multiple labellings: for artists, I thought it was unnecessary because agreement will be very high.  Here, I think it's of dubious value because agreement will be so low; in both cases, errors in ground truth impact all participants equally, and so are not really a concern - we are mostly interested in relative values, so a ceiling on absolute performance due to a few 'incorrect' reference labels is of little consequence. 
 +
 +
Clearly, we can run a genre contest: I would again advocate for real music, and not worry too much about copyright issues, and not even worry too much about where the genre ground truth comes from, since it is always pretty suspect; allmusic.com is as good a source as any. But I personally find this contest of less intellectual interest than artist ID, even though it has historically received more attention, because of the poor definition of the true, underlying classes. 
 +
 +
I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms.  It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).
  
 
==Review 2==
 
==Review 2==

Revision as of 09:18, 2 February 2005

Proposer

Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk


Title

Genre Classification from polyphonic audio.


Description

The automatic classification of polyphonic musical audio (in PCM format) into a single high-level genre per example. If there is sufficient demand, a multiple genre track could be defined, requiring submissions to identify each genre (without prior knowledge of the number of labels), with the precision and recall scores calculated for each result.

1) Input data The input for this task is a set of sound file excerpts adhering to the format, metadata and content requirements mentioned below.

Audio format:

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • Either whole files or 1 minute excerpts

Audio content:

  • polyphonic music
  • data set should include at least 8 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Ballroom Dance, Electronic/Modern Dance, Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock)
  • the classification could also be evaluated in two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).
  • both live performances and sequenced music are eligible
  • Each class should be represented by a minimum of 100 examples, but 150 would be preferred. If possible the same number of examples should represent each class.
  • If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.

Metadata:

  • By definition each example must have a genre label corresponding to one of the output classes.
  • Where possible existing genre labels should be confirmed by two or more non-entrants, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants.
  • The training set should be defined by a text file with one entry per line, in the following format:

<example path and filename>\t<genre label>\n

2) Output results Results should be output into a text file with one entry per line in the following format:

<example path and filename>\t<genre classification>\n


Potential Participants

  • Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, High
  • Elias Pampalk (├ûFAI), elias@oefai.at, High
  • George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, High
  • Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High
  • Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, High
  • Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
  • Fran├ºois Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium


Evaluation Procedures

3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.

Evaluation measures:

  • Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).
  • Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.

Evaluation framework:

Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early Febuary for submission development.


Relevant Test Collections

Re-use Magnatune database (???) Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons) Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments) Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites

Ground truth annotations:

All annotations should be validated, rather than accepting supplied genre labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be exended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.


Review 1

The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples. Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.

The tasks are well-defined, easily understood, and appear to have some practical importance. The evaluation and testing procedures are very good. This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.

My only comments relate to the choice of ground truth data. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution. If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders. (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)

My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.

In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results. In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete. Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat.

Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions. However, having each site run their own classifiers is a huge win in terms of the logistics of running the test. I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.

I'm not sure how important the M2K/D2K angle is. It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation. By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding.

In terms of the genre contest, the big issue is the unreliability and unclear definitions of the ground truth labels. It seems weird to have one evaluation on the ability to distinguish an arbitrary set of artists - a very general-sounding problem - and another contest which is specifically dominated by the ability to distinguish classical from jazz from rock - a very specific, and perhaps not very important, problem.

Again in this case I don't particularly like the idea of trying to get multiple labellings: for artists, I thought it was unnecessary because agreement will be very high. Here, I think it's of dubious value because agreement will be so low; in both cases, errors in ground truth impact all participants equally, and so are not really a concern - we are mostly interested in relative values, so a ceiling on absolute performance due to a few 'incorrect' reference labels is of little consequence.

Clearly, we can run a genre contest: I would again advocate for real music, and not worry too much about copyright issues, and not even worry too much about where the genre ground truth comes from, since it is always pretty suspect; allmusic.com is as good a source as any. But I personally find this contest of less intellectual interest than artist ID, even though it has historically received more attention, because of the poor definition of the true, underlying classes.

I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms. It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).

Review 2

The single genre problem is well defined and seems to be a relevant problem for the MIR community nowadays. Obviously, it would be more relevant to classify each track into multiple genres or to use a hierarchy of genres, but the proposal does not deal with these issues in a satisfying way. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are nodes for Electronic and Jazz/Blues, where lies the leaf Electro-jazz ? I suggest that the contest concentrates on the well-defined simple genre problem. An interesting development of it would be to ask algorithms to associate a percentage of probability to each predefined genre on each track, instead of outputing a single genre with 100% probability. Regarding the input format, I think that whole files are better (the total duration and the volume variation are already good genre descriptors) and that polyphony is not required (classical music contains many works for solo instruments).

I have no precise opinion regarding the defined genres, since this is more of a cultural importance. I'm not sure that Rock is less diverse than World (what's the common point between Elvis and Radiohead ?). Also I am surprised that there is no Rap/RnB. The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Thus I would like this choice to be more discussed by the participants.

The list of participants is relevant. McKinney and Breebart could be added.

It is a good idea to accept many programming languages for submission. However it seems quite difficult to implement the learning phase, because each algorithm may use different structures to store learnt data. For instance, when the algorithm computes descriptors and feeds them through a classifier, is it possible to select the best descriptors ? If not, it is not realistic to suppose that the participant has to do it beforehand on his own limited set of data. Then I see two possibilities: either participants are given 50% of the database and do all the learning work themselves (then no k-fold cross validation is performed), or submissions concern only sets of descriptors and not full classification algorithms. The second choice has the advantage of allowing to compare different sets of descriptors with the same classifiers.

The test data are relevant but still a bit vague. Obviously existing databases should be used again and completed with new annotated data. The participants should list their own databases in detail and put them in common for evaluation in order to evaluate the time needed to annotate new data.

Downie's Comment

1. Think genre tasks are kinda fun, actually. Devil is in the details. Would give my eye teeth to avoid manually labelling genre classes. You set up eight classes with 100-150 examples. That comes to 800-1200 labels that need applying. Can we as a group come up with a possible standardized source for genre labels and then, even though they are not perfect, live with our choice? Perhaps in this early days, we would be best served by looking at only the broadest of categories and not fussing about the fine-grained subdivisions?

2. Would be interesting to have a TRUE genre task! As we learned in the UPF doctoral seminar prior to ISMIR 2004, genre is properly defined as the "use" of the music: dance, liturgical, funereal, etc. What we are calling genre here is really style. Just a thought.