2006:Audio Music Similarity and Retrieval
This page is devoted to discussions of the evaluation of Audio Music Similarity algorithms at MIREX 2006. Discussions have already begun on the MIREX 06 "AudioSim06" contest planning list and will be briefly digested here. A full digest of the discussions is available to subscribers from the MIREX 06 "AudioSim06" contest planning list archives.
As consensus is achieved on the planning list, a full proposal (Audio Music Similarity proposal) will be produced for the format of the evaluation, including pseudocode for the evaluation metric and suggested formats for submitted algorithms. A skeleton of proposal is already available on the Audio Music Similarity proposal page.
- Kris West (University of East Anglia, UK) - email@example.com
- Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - firstname.lastname@example.org
- Paul Lamere (Sun Microsystems Laboratories, USA) - email@example.com
Although the automatic extraction of genre and artist labels from audio are interesting tasks, I (KW) believe that they are often used to evaluate more general music similarity techniques that compare two songs based on their audio content. These techniques are hard to evaluate directly, for example with listening tests, as it is not practical to have a human listener rank the similarities of even a small test collection for a number of queries, which might require many hours of listening. Therefore, We have begun discussion of other methods of evaluating music similarity techniques, such as the methods described in Logan & Saloman (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs were retrieved and the average number of songs in the same genre, from the same artist and from the same album calculated. This evaluation could be extended to multiple genres if data is available. I believe it is also important that we evaluate other characteristics of these algorithms, such as the descriptor extraction time, query time and memory footprint (which may indicate the applicability of a technique to an application).
Important threads on the discussion list
Types of evaluation
There have been a number of papers describing similarity evaluation, including those by Whitman, Berenzweig, Ellis and Logan. The methods used generally fall into the following buckets:
- Subjective precision via user tests
- Expert opinion (similar artist lists from music editors like All Music Guide)
- Playlist Co-occurrence
- User Collection Co-occurrence
- objective statistics based upon album, artist and genre labels. (TopN, average distance)
For a standard, annual evaluation like MIREX, the first four types of evaluations seem problematic.
1 - subjective precision - is very expensive to collect this data for alarge music collection, and would likely be unreliable unless many users were evaluated.
2 - Expert opinion - expert opinion will usually rate similarity of artists but not songs. Also, not transitive, coldplay may sound like the Beatles, but no one ever says the beatles sound like coldplay. This data generally only exists for popular artists (i.e. not for artists typically found in 'free' collections of music).
3,4 - Playlist Co-occurrence, User Collection co-occurrence - works for popular music, but usually not enough coverage for less popular music, generally not suitable for our test collections (such as magnatune or epitonic), since this music is not listened to by enough people.
Further comments from Paul Lamere
Elias suggests using closest 20 from the same genre using an artist filter. We've used a similar metric for a similarity evaluator with some mixed results. We had one particular system that generated very tight genre clusters but within the clusters there was little similarity discrimination. This system scored very well with a top N test, but subjectively the results were very poor. For instance, with a seed song of a string quartet, the most similar songs produced by the system was a choral piece, a symphony and a piano piece. All were Classical pieces, but subjectively the songs were not similar. Also, it seems that a simple top 20 metric based solely on genre would result in the evaluation devolving into just another kind of genre classification evaluation. All that being said, I also understand the problems with overfitting on artist and album, if these are included in the evaluation metric ...
Opt-in survey of Audio music similarity researchers
In this section we would like to take a brief 'opt-in' survey of researchers actively working in this field. Please feel free to add yourself to the list (or email your details to the moderators listed above).