2006:Audio Music Similarity and Retrieval
- 1 Overview
- 2 Moderators
- 3 Introduction
- 4 Important threads on the discussion list
- 4.1 To Do
- 4.2 The importance of context in music similarity evaluation
- 4.3 Types of evaluation
- 4.4 Factors to evaluate
- 4.5 Objective statistics based upon album, artist and genre labels proposal
- 4.6 Subjective evaluation
- 5 Related Papers
- 6 Opt-in survey of Audio music similarity researchers
This page is devoted to discussions of the evaluation of Audio Music Similarity algorithms at MIREX 2006. Discussions have already begun on the MIREX 06 "AudioSim06" contest planning list and will be briefly digested here. A full digest of the discussions is available to subscribers from the MIREX 06 "AudioSim06" contest planning list archives.
As consensus is achieved on the planning list, a full proposal (Audio Music Similarity proposal) will be produced for the format of the evaluation, including pseudocode for the evaluation metric and suggested formats for submitted algorithms. A skeleton of proposal is already available on the Audio Music Similarity proposal page.
- Kris West (University of East Anglia, UK) - email@example.com
- Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - firstname.lastname@example.org
- Paul Lamere (Sun Microsystems Laboratories, USA) - email@example.com
Although the automatic extraction of genre and artist labels from audio are interesting tasks, I (KW) believe that they are often used to evaluate more general music similarity techniques that compare two songs based on their audio content. These techniques are hard to evaluate directly, for example with listening tests, as it is not practical to have a human listener rank the similarities of even a small test collection for a number of queries, which might require many hours of listening. Therefore, We have begun discussion of other methods of evaluating music similarity techniques, such as the methods described in Logan & Saloman (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs were retrieved and the average number of songs in the same genre, from the same artist and from the same album calculated. This evaluation could be extended to multiple genres if data is available. I believe it is also important that we evaluate other characteristics of these algorithms, such as the descriptor extraction time, query time and memory footprint (which may indicate the applicability of a technique to an application).
Important threads on the discussion list
- Kris West
- Paul Lamere
- Elias Pampalk
- Fabian M├╢rchen
- George Tzanetakis
- Dan Ellis
- Stephen Green
- Rebecca Fiebrink
- Mark Levy
- Hamish Allan
- Anders Meng
- Adam Lindsay
- Integrate Dan Ellis, George Tzanetakis and Hamish Allan's comments (re: data and contexts).
- provide better/shorter summaries
The importance of context in music similarity evaluation
Collection specific learning
Types of evaluation
There have been a number of papers describing similarity evaluation, including those by Whitman, Berenzweig, Ellis and Logan. The methods used generally fall into the following buckets:
- Subjective precision via user tests
- Expert opinion (similar artist lists from music editors like All Music Guide)
- Playlist Co-occurrence
- User Collection Co-occurrence
- objective statistics based upon album, artist and genre labels. (TopN, average distance)
For a standard, annual evaluation like MIREX, the first four types of evaluations seem problematic.
1 - subjective precision - is very expensive to collect this data for alarge music collection, and would likely be unreliable unless many users were evaluated.
2 - Expert opinion - expert opinion will usually rate similarity of artists but not songs. Also, not transitive, coldplay may sound like the Beatles, but no one ever says the beatles sound like coldplay. This data generally only exists for popular artists (i.e. not for artists typically found in 'free' collections of music).
3,4 - Playlist Co-occurrence, User Collection co-occurrence - works for popular music, but usually not enough coverage for less popular music, generally not suitable for our test collections (such as magnatune or epitonic), since this music is not listened to by enough people.
Factors to evaluate
- feature extraction time
- distance computation time
- memory consumption durring feature extraction
- memeory consumption durring distance computation
objective statistics based upon genre (with artist filter), artist & album labels:
- closest 1 (ratio of pieces in same genre/artist/album as query)
- closest 5 (-"-)
- closest 10 (-"-)
- closest 20 (-"-)
- clustering performance (e.g. ratio of the average intra-genre distance to the average inter-genre distance, see below)
Objective statistics based upon album, artist and genre labels proposal
Use methods described in Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs are retrieved and the average number of songs in the same genre, from the same artist and from the same album are calculated.
An additional evalaution of the clustering performance of an algorithm could be calculated as the ratio of the average intra-genre distance (within-class scatter or cohesion) to the average inter-genre distance (between class scatter or separation).
Justification for using artist and album labels
At Sun Labs we've been auditioning a number of different similarity models. We've had some models that behaved in ways that are apparently similar to your 'spectrum histogram', in that they yield good objective scores (a high percentage of songs in the top 20 are of the proper genre, the average intra-genre distance is low compared to the overall average distance), but when using the models for actual playlist generation or visualization we'd experience the similar 'space-time distortions'. The 75% of the songs might be of the proper genre, but the other 25% would be way off. We call them 'clunkers', songs that no human would say belongs in the playlist of similar songs. Also, we'd see other similar problems, where the songs in the 75% of the proper genre were not really very similar. A folk rock song would be 'near' a punk rock song, or a choral piece would be near a harpsichord piece. One way of reducing this problem would be to take into account the artist and album metadata in the evaluation. Presumably there are three enclosing clusters: at the coarsest level is a genre cluster, within this cluster we could expect to find multiple artist clusters, and within an artist cluster we may find multiple album clusters. Now, if it turns out that the similarity classifier being evaluated really makes no distinction of nearness within the genre (e.g choral and harpsichord music can be 'near'), then the ratio of the average artist distance to average genre distance will approach one (the artist cluster is as large as the genre cluster), but if this ratio is much less than one then we have small artist clusters within the genre cluster. The harpsichord music has clustered together and separated it from the choral music performed by a different artist. The album clustering can give another, finer-grained level (but I'm not convinced that this extra level is necessary).
To sum up then, the average artist distance compared to the average genre distance gives a notion of how well a similarity metric is working within a single genre.
Volatility of objective statistics and clustering metric
It has been suggested that the use of the mean average in the calculation of the above statistics is too volatile as it is influenced by outliers and we should therefore use the median average. Outliers easily sneak into any ground truth or can be caused by choosing a bad segment of the song. However it should be noted that if a participant selects a poor segment from a song that is indicative of the performance of the assumption that you can randomly select a representative sample from a piece of music (an intelligent segmentation or thumbnailing technique might have a significant advantage). However, if a poor query segment is selected for the evaluation, it will effect all submissions equally. If a particular submission handles a poor query better it should achieve a better evaluation score, rather than have that additional performance averaged out. I.e. if an algorithm doesn't produce as many outliers, that fact should represented in the evaluation score.
The use of the median average will most likely improve all calculated ratios, but will reduce the difference between algorithms. A more useful alternative may be the trimmed mean (remove 1 - 2% of results from both ends of each distribution then calculate mean). Another interesting statistic maybe the difference between the mean and median statistics as a lower value should indicate that the algorithm handles outliers well (is less volatile), while a higher value will indicate more outliers in the distribution.
Here's another possibility for evaluation. I notice that they are now around 60 people subscribed to this email list. I imagine that this will grow to at least 100 or so by the time the mirex 2006 evaluation begins. With this many people, perhaps we can do some real human evalutions of submitted systems. If we can get each person on this list to agree to evaluate 10 playlists and rate them on a scale of 1 to 5 in terms of how similar the songs are, then we can get 1000 playlists evaluated. If 50 systems are submitted, that means we can have 20 different people look at each playlist. Here's what we could do:
- Each system would be given a seed song and asked to generate a 'playlist' of the 10 most similar songs (all systems would receive the same seed song)
- The lists would be anonymized (song and artists removed from titles to prevent bias from the metadata)
- The playlists would be randomly distributed to the set of evaluators to be scored (a web interface would suffice).
- Evaluators assign a 1 to 5 rating to each of their ten assigned playlists
- The average score for each playlist would recorded as part of the evaluation
- The resulting playlists can be made available for browsing after the evaluation results are published.
Since music similarity is such a human quantity I think that it is important to have some user evaluation of similarty as part of MIREX. Thoughts?
Alternatives to rating on a scale
A pairwise comparison approach ("Is playlist A better or worse than playlist B?") might be more appropriate than rating on a scale (where, for example, individuals might prefer or avoid the middle of the scale by nature, or where all ten of someone's playlists could be in reality very bad compared to the set of all generated playlists, but he or she would have no basis to judge them as such). A one-bit measurement (relevant or irrelevant) has really paid off in text retrieval. I expect it would pay off in music retrieval as well.
A one bit measure of overall similarity may be tricky to come by. But there are plenty of one bit musical features that could be measured quickly and pretty consistently, and tracks with a sufficient proportion of these in common can reasonably be called similar. This is presumably how the folks at pandora.com manage to process a new track in 20 seconds. Could this approach work?
- A Music Similarity Function Based On Signal Analysis
- A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures
- Toward Evaluation Techniques for Music Similarity
- The Quest for Ground Truth in Music Similarity
- Improving Timbre Similarity: How high is the sky?
- Music Similarity Metrics
- Improvements of Audio-based Music Similarity and Genre Classification
Opt-in survey of Audio music similarity researchers
In this section we would like to take a brief 'opt-in' survey of researchers actively working in this field. Please feel free to add yourself to the list (or email your details to the moderators listed above).