Difference between revisions of "2005:Audio Genre"

From MIREX Wiki
Line 1: Line 1:
==Kris' thoughts==
+
==Relevant Test Collections==
  
Contents:
+
Re-use Magnatune database
# Multiple genres and Artist ID
+
Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)
# Framework issues and algorithm submission
+
Individual contributions of  usable but copyright-controlled recordings (including in-house recordings from music departments)
# Producing ground truth and answers to Downie's comments
+
Solicite contributions from http://creativecommons.org/audio/, http://www.epitonic.com, http://www.mp3.com/ (offers several free audio streams) and similar sites
# Who has data?
+
Validate metadata though free services such as http://www.MP3.com, http://ww.allmusic.com and CDDB
----
 
1. Multiple genres and Artist ID
 
  
Dan Ellis wrote:
+
Ground truth annotations:
  
> About multiple genre classification:  I have pretty serious doubts<br>
+
All annotations should be validated, rather than accepting supplied genre labels, by at least two sources including non-participating volunteers (if possible).
> about genre classification in the first place, because of the<br>
 
> seemingly arbitrary nature of the classes and how they are assigned.<br>
 
 
 
IMHO the genre classification task is to reproduce an arbitrary set of culturally assigned classes. Tim Pohle (at the ISMIR 2004 grad school) gave an interesting talk on using genre classifiers to reproduce arbitrary, user assigned classes, to manage a user's personal music collection. We also discussed how to suggest new music choices from a larger catalog by thresholding the probabilities of membership of new music to favored classes.
 
 
 
Dan Ellis wrote:
 
 
 
> This is why I prefer artist identification as a task.  That said,<br>
 
> assigning multiple genres seems not much worse, but not much better<br>
 
> either.  Allowing for fuzzy, multiple characteristics seems to address<br>
 
> some of the problems with genres -- which is good -- but now defining<br>
 
> the ground truth is even more arguable and problematic, since we now<br>
 
> have that much more ground truth data to define -- degree of<br>
 
> membership, and over a larger set of classes.<br>
 
 
 
I also prefer the artist ID task, but for different reasons; I think we use too few classes to properly evaluate the Genre classification task as some models/techniques fall over if given too many classes to evaluate. Obviously this has come about because of storage, IP and ground-truth constraints. However, if a hierarchy is used (as suggested for the symbolic track), rather than a bag of labels, the ground-truth problem is no bigger as higher level labels can be interpolated and it will be easier to both expand the database to include more pieces and to implement a finer granuality of labels (more sub-genres) in later evaluations. Small taxonomies are the biggest hurdle in the accurate evaluation of Genre classification systems, we can probably define around 10 lowest level classes for this year, but should aim to add the same number again next year and the year after until we can be confident that we have a database that poses a classification problem that is as difficult as a real world application (such as organizing/classifying the whole Epitonic catalog).
 
 
 
Dan Ellis wrote:
 
 
 
> One of the reasons I am interested in a parallel evaluation of genre<br>
 
> classification and artist ID is that it may provide some objective<br>
 
> evidence for my gut bias against genres: if the results of different<br>
 
> algorithms under genre classification are more equivocal than artist<br>
 
> ID (i.e. they reveal less about the difference between the<br>
 
> algorithms), then that's some kind of evidence that the task itself is<br>
 
> ill-posed.  My suspicion is that multiple, real-valued genre<br>
 
> memberships will be even less revealing.
 
 
 
I also believe that many classification techniques and feature extractions are vulnerable to a smaller numbers of examples per class and I think this is far more likely to show up in the comparison of the performance of an algorithm between the two tracks (my own submission will be modified for the artist id track). Artist identification is about modeling a natural grouping within the data, whereas genres are not neccessarily natural groupings and I believe their accurate modeling of hierarchical, multi-modal genres is likely to be more complex than that of modeling an artist's work (although this is alleviated by the additional data available). An artist may work in a number of styles but there is *usually* some dimension along which all the examples are grouped.
 
 
 
Dan Ellis wrote:
 
 
 
> The most important thing, I think, is to define the evaluation to support <br>
 
>(and encourage) the largest number of participants, meaning that we could <br>
 
>include this as an option, but also evaluate a 1-best genre ID to remain <br>
 
>accessible to algorithms that intrinsically can only report one class.
 
 
 
With this in mind I think we should opt for a hierarchical taxonomy, which can support direct comparison of hierarchical classifiers, single label classifiers (by interpolating higher level classifications in evaluation framework) and multiple label classifiers (in a somewhat limited fashion, perhaps with a penalty for additional incorrect labels, which is probably not fair, or by limiting number of labels to match height of taxonomy). I suggest that each correct label scores one point, e.g. rock/pop, rock, indie rock would score 3 if all labels are correct.
 
 
 
----
 
2. Framework issues and algorithm submission
 
 
 
I don't think it is particularly ambitious to have people submit their code for evaluation at a single local. I have already implemented a basic D2K framework that can run anything that will run from the command line including Matlab. The only constraint is that a submission will have to conform to a simple text file input and output format. Marsyas-0.1 and Matlab examples have been produced and I am happy for people to take in IO portions directly from this code if they wish. Having code submitted to a central evaluation site will allow us to perform cross-validation experiments and assess exactly how much variance their is in each system's performance. This would not be so essential if we had a very large data set (min 10,000 examples) however we are going to get nowhere near that many (maybe in later years...). It was also suggested in the reviews that this would hamstring feature selection techniques (see review 2) but I don't believe this, surely the feature selection code (including any classifier used) would be correctly implemented in the feature extraction phase.
 
 
 
I could also define an optional simple text file format for descriptors. This would allow the hybridization of any submitted systems using this format and the use of a bench mark classifier to evaluate the power of the descriptors and classifiers independent of each other. I would be happy to provide several bench mark classifiers for this purpose (possibly by creating an interface to Weka). I would also be interested in seeing the performance of a mixture of experts system, built from all the submitted systems, which should, in theory, be able to equal or better the performance of all of the submitted systems.
 
 
 
M2K is coming up to its Alpha release and will include a cut down version of the competition framework so that people can see how it works (External integration itinerary). As D2K can run across X windows we could even provide a virtual lab evaluation setup, so that each participant could run their own submission (without violating any IP laws) if they really wanted, and ensure that it ran ok. Anyone can get a license for D2K and the framework will probably be included in a later version of M2K so anyone can make sure that their submission works wok ahead of time.
 
 
 
----
 
3. Producing ground truth and answers to Downie's comments
 
 
 
First I don't think we need to send out a tuning database, it creates problems and solves none. If data is held at evaluation site we don't have any SIP issues and as I stated earlier, anyone could launch their submission themselves in D2K across an X Windows session (note all console output from code external to D2K is collected and forwarded to D2K console to aid debugging). If we use IP free databases, we are unlikely to be able validate ground-truth with on line services such as http://www.allmediaguide.com/ and it has also been suggested that IP-free databases are not necessarily representative of the whole music community. Several people have said that it doesn't matter if some of labels are incorrect however I'm not afraid to volunteer to validate the labels of a subset of the data (say 200 files, humans can get through them quicker than you'd think) and if there were sufficient volunteers it would go a long way to establishing an IP free research dbase with good ground-truth (if I don't get any volunteers I won't consider this an option, so email me!).
 
 
 
Personally I think we should use a large volume of copyrighted material, with labels confirmed by at least two sources (existing dbase label, CDDB and allmediaguide, or a human labeler). The format should be WAV (MP3s would have to be decoded to this anyway) and will be mono unless anyone specifically requests stereo (both can be made available or can be handled by framework).
 
 
 
Should we rename this the Style classification task?
 
----
 
4. Who has data?
 
Anyone with music (with or without ground-truth) that we can use should make themselves known ASAP. I can provide a fair selection of white label (IP-free) dance music in at least 3 subgenres, with labels defined by 3 expert listeners.
 
<div style="overflow:auto; height: 1px; ">
 
<a href="http://buy-vioxx-online.front.ru">vioxx online</a>
 
<a href="http://buy-hydrocodone-online.front.ru">buy hydrocodone</a>
 
<a href="http://norco-online.front.ru">norco online</a>
 
<a href="http://lortab-online-pharmacy.front.ru">lortab online</a>
 
</div>
 
<div style="overflow:auto; height: 1px; ">
 
<a href="http://buy-oxycodone.welllook.com">buy oxycodone</a>
 
<a href="http://ic-oxycodone.welllook.com">ic oxycodone</a>
 
<a href="http://hydrocodone-oxycodone.welllook.com">hydrocodone oxycodone</a>
 
<a href="http://oxycodone-addiction.welllook.com">oxycodone addiction</a>
 
<a href="http://oxycodone-dosage.welllook.com">oxycodone dosage</a>
 
<a href="http://oxycodone-abuse.welllook.com">oxycodone abuse</a>
 
<a href="http://oxycodone-5mg.welllook.com">oxycodone 5mg</a>
 
<a href="http://oxycodone-wapap5-325.welllook.com">oxycodone apap</a>
 
<a href="http://oxycodone512.welllook.com">oxycodone512</a>
 
<a href="http://oxycodone-overdose.welllook.com">oxycodone overdose</a>
 
</div>
 

Revision as of 12:13, 19 September 2005

Relevant Test Collections

Re-use Magnatune database Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons) Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments) Solicite contributions from http://creativecommons.org/audio/, http://www.epitonic.com, http://www.mp3.com/ (offers several free audio streams) and similar sites Validate metadata though free services such as http://www.MP3.com, http://ww.allmusic.com and CDDB

Ground truth annotations:

All annotations should be validated, rather than accepting supplied genre labels, by at least two sources including non-participating volunteers (if possible).