2006:Audio Music Similarity and Retrieval
- 1 Overview
- 2 Moderators
- 3 Introduction
- 4 Important threads on the discussion list
- 4.1 Context related issues in music similarity evaluation
- 4.2 Types of evaluation
- 4.3 Factors to evaluate
- 4.4 Objective statistics based upon album, artist and genre labels proposal
- 4.5 Subjective evaluation
- 5 Related Papers
- 6 Opt-in survey of Audio music similarity researchers
This page is devoted to discussions of the evaluation of Audio Music Similarity algorithms at MIREX 2006. Discussions have already begun on the MIREX 06 "AudioSim06" contest planning list and will be briefly digested here. A full digest of the discussions is available to subscribers from the MIREX 06 "AudioSim06" contest planning list archives.
As consensus is achieved on the planning list, a full proposal (Audio Music Similarity proposal) will be produced for the format of the evaluation, including pseudocode for the evaluation metric and suggested formats for submitted algorithms. A skeleton of proposal is already available on the Audio Music Similarity proposal page.
- Kris West (University of East Anglia, UK) - firstname.lastname@example.org
- Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - email@example.com
- Paul Lamere (Sun Microsystems Laboratories, USA) - firstname.lastname@example.org
Although the automatic extraction of genre and artist labels from audio are interesting tasks, I (KW) believe that they are often used to evaluate more general music similarity techniques that compare two songs based on their audio content. These techniques are hard to evaluate directly, for example with listening tests, as it is not practical to have a human listener rank the similarities of even a small test collection for a number of queries, which might require many hours of listening. Therefore, We have begun discussion of other methods of evaluating music similarity techniques, such as the methods described in Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs were retrieved and the average number of songs in the same genre, from the same artist and from the same album calculated and more practical methods of subjective evaluation of similarity estimators (i.e. evaluation of performance, rather than comparison of output to that of human annotators). This evaluation could be extended to multiple genres if data is available. I believe it is also important that we evaluate other characteristics of these algorithms, such as the descriptor extraction time, query time and memory footprint (which may indicate the applicability of a technique to an application).
This page serves as a summary of the discussions held on the AudioSim06 mailing list and will eventually hold a final evaluation propoisal for MIREX 2006.
Important threads on the discussion list
- Kris West
- Paul Lamere
- Elias Pampalk
- Fabian M├╢rchen
- George Tzanetakis
- Dan Ellis
- Stephen Green
- Rebecca Fiebrink
- Mark Levy
- Hamish Allan
- Anders Meng
- Adam Lindsay
1) Is the notion of music similarity consistent between different humans/cultures/music education etc ? One thing I know for sure is that to all of you most pieces of folk music from the island of Crete would sound very similar whereas to people from Crete (including me) they sound completely unique.
2) Does it even make sense to speak of similarity as a one dimensional quantity ? For example is the dance version of Carmina Burana more similar to a classical recording of Carmina Burana or to another dance piece using the same drum loop.
3) Can similarity be context-independent ? Similarity only makes sense relative to a particular context. Billie Holiday is very different from Ella Fitzgerald in a context of female jazz singer however might be perceived as very similar in a general context of female singers including Britney Spears and Anni DiFranco.
Each of us consider similarity in a very different manner. Consider the scenario of human ranking of playlists according to similarity. The ranking we would get out of this "non-guided" (flat prior) similarity evaluation would be a kind of average ranking, since each user ranks after his preferences: E.g. user 1 might rank after vocal similarity, while user 2 ranks after instrument similarity, and so on. Perhaps it would be beneficial for the end user of some fancy music retrieval system in the future either to find music based on a kind of "average" similarity (which I guess a lot of people would be happy with) or perhaps be able to be select his/her dimension of similarity, say : "Vocal similarity, like bono....". Perhaps a multidimensional similarity evaluation will be possible next year, but would almost certainly have to involve the generation of ground-truth through subjecctive similarity judgements.
The issue of 'context-dependent' similarity is not so hard to deal with. If I give you one track and ask for the most similar ones, there's no context. But if I give you 20, the spread within the set defines the context, and the algorithms can try to infer it. So if those tracks all have Paul McCartney singing, or if they all use cellos, or if they all use a simple tonic-subdominant-dominant chord progression, it seems a well-formed problem to ask an algorithm to infer the correct aspect on which to perform matching.
Predicting human-generated playlists is one possible way to frame this. By trawling the web, or scraping people's iTunes databases, you can get a lot of sets of songs that people have lumped together for one reason or another, representative of the real spread of 'contexts' relevant to users. Having algorithms attempt to complete the rest of a playlist given the 1st half is something you can measure, even if performance is bound to be low in absolute terms. I'm not frightened of low absolute performance provided there is still measurable difference between different systems.
Another observation worth making is that a relatively small dataset (e.g. ~5000 tracks), where both data and metadata come from a common source (e.g. a single record label), defines its own limited context as you can reasonably expect that the genre classifications (used to organise the collection) have been applied in a consistent manner without outliers. The likelihood of this being the case decreases as the collection size rises, as it would require more editors and more judgements to organise.
Collection specific learning
It appears that music similarity estimators can be roughly divided into three groups, based the types of data that they leverage: purely content-based, augmented with behavioural data (such as skipping behaviour, playlist co-occurence etc.), and trained (content-based with Collection specific parameters estimated from a labelled subset of the database or an independent database). Purely content-based estimators appear to be the default mode for this evaluation. However, evaluation of trained submissions should be possible. Assuming an evaluation database size of 4000+ examples, 1000 - 1500 examples could be held back for training. Examples should be selected from the database in the same proportions that they occur. It might also be interesting to evaluate two copies of trained algorithms, one trained on the subset from the database and another trained on a separate dataset (greater differences in performance based on the two training sets may indicate overfitting to the collection, while smaller differences may indicate better generalisation).
Types of evaluation
There have been a number of papers describing similarity evaluation, including those by Whitman, Berenzweig, Ellis and Logan. The methods used generally fall into the following buckets:
- Subjective precision via user tests
- Expert opinion (similar artist lists from music editors like All Music Guide)
- Playlist Co-occurrence
- User Collection Co-occurrence
- objective statistics based upon album, artist and genre labels. (TopN, average distance)
For a standard, annual evaluation like MIREX, the first four types of evaluations seem problematic.
1 - subjective precision - is very expensive to collect this data for alarge music collection, and would likely be unreliable unless many users were evaluated.
2 - Expert opinion - expert opinion will usually rate similarity of artists but not songs. Also, not transitive, coldplay may sound like the Beatles, but no one ever says the beatles sound like coldplay. This data generally only exists for popular artists (i.e. not for artists typically found in 'free' collections of music).
3,4 - Playlist Co-occurrence, User Collection co-occurrence - works for popular music, but usually not enough coverage for less popular music, generally not suitable for our test collections (such as magnatune or epitonic), since this music is not listened to by enough people.
Factors to evaluate
- feature extraction time
- distance computation time
- memory consumption durring feature extraction
- memeory consumption durring distance computation
objective statistics based upon genre (with artist filter), artist & album labels:
- closest 1 (ratio of pieces in same genre/artist/album as query)
- closest 5 (-"-)
- closest 10 (-"-)
- closest 20 (-"-)
- clustering performance (e.g. ratio of the average intra-genre distance to the average inter-genre distance, see below)
Objective statistics based upon album, artist and genre labels proposal
Use methods described in Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs are retrieved and the average number of songs in the same genre, from the same artist and from the same album are calculated.
An additional evalaution of the clustering performance of an algorithm could be calculated as the ratio of the average intra-genre distance (within-class scatter or cohesion) to the average inter-genre distance (between class scatter or separation).
Justification for using artist and album labels
At Sun Labs we've been auditioning a number of different similarity models. We've had some models that behaved in ways that are apparently similar to your 'spectrum histogram', in that they yield good objective scores (a high percentage of songs in the top 20 are of the proper genre, the average intra-genre distance is low compared to the overall average distance), but when using the models for actual playlist generation or visualization we'd experience the similar 'space-time distortions'. The 75% of the songs might be of the proper genre, but the other 25% would be way off. We call them 'clunkers', songs that no human would say belongs in the playlist of similar songs. Also, we'd see other similar problems, where the songs in the 75% of the proper genre were not really very similar. A folk rock song would be 'near' a punk rock song, or a choral piece would be near a harpsichord piece. One way of reducing this problem would be to take into account the artist and album metadata in the evaluation. Presumably there are three enclosing clusters: at the coarsest level is a genre cluster, within this cluster we could expect to find multiple artist clusters, and within an artist cluster we may find multiple album clusters. Now, if it turns out that the similarity classifier being evaluated really makes no distinction of nearness within the genre (e.g choral and harpsichord music can be 'near'), then the ratio of the average artist distance to average genre distance will approach one (the artist cluster is as large as the genre cluster), but if this ratio is much less than one then we have small artist clusters within the genre cluster. The harpsichord music has clustered together and separated it from the choral music performed by a different artist. The album clustering can give another, finer-grained level (but I'm not convinced that this extra level is necessary).
To sum up then, the average artist distance compared to the average genre distance gives a notion of how well a similarity metric is working within a single genre.
Volatility of objective statistics and clustering metric
It has been suggested that the use of the mean average in the calculation of the above statistics is too volatile as it is influenced by outliers and we should therefore use the median average. Outliers easily sneak into any ground truth or can be caused by choosing a bad segment of the song. However it should be noted that if a participant selects a poor segment from a song that is indicative of the performance of the assumption that you can randomly select a representative sample from a piece of music (an intelligent segmentation or thumbnailing technique might have a significant advantage). However, if a poor query segment is selected for the evaluation, it will effect all submissions equally. If a particular submission handles a poor query better it should achieve a better evaluation score, rather than have that additional performance averaged out. I.e. if an algorithm doesn't produce as many outliers, that fact should represented in the evaluation score.
The use of the median average will most likely improve all calculated ratios, but will reduce the difference between algorithms. A more useful alternative may be the trimmed mean (remove 1 - 2% of results from both ends of each distribution then calculate mean). It has also been commented that, in generated playlists, outliers can ruin the perception of the performance of the rest of the generated playlist and therefore the use of a metric which deliberately ignores outliers is highly inappropriate (this applies to both the median and trimmed mean).
Another interesting statistic maybe the difference between the mean and median statistics as a lower value should indicate that the algorithm handles outliers well (is less volatile), while a higher value will indicate more outliers in the distribution.
Here's another possibility for evaluation. I notice that they are now around 60 people subscribed to this email list. I imagine that this will grow to at least 100 or so by the time the mirex 2006 evaluation begins. With this many people, perhaps we can do some real human evalutions of submitted systems. If we can get each person on this list to agree to evaluate 10 playlists and rate them on a scale of 1 to 5 in terms of how similar the songs are, then we can get 1000 playlists evaluated. If 50 systems are submitted, that means we can have 20 different people look at each playlist. Here's what we could do:
- Each system would be given a seed song and asked to generate a 'playlist' of the 10 most similar songs (all systems would receive the same seed song)
- The lists would be anonymized (song and artists removed from titles to prevent bias from the metadata)
- The playlists would be randomly distributed to the set of evaluators to be scored (a web interface would suffice).
- Evaluators assign a 1 to 5 rating to each of their ten assigned playlists
- The average score for each playlist would recorded as part of the evaluation
- The resulting playlists can be made available for browsing after the evaluation results are published.
Since music similarity is such a human quantity I think that it is important to have some user evaluation of similarty as part of MIREX. Thoughts?
Alternatives to rating on a scale
A pairwise comparison approach ("Is playlist A better or worse than playlist B?") might be more appropriate than rating on a scale (where, for example, individuals might prefer or avoid the middle of the scale by nature, or where all ten of someone's playlists could be in reality very bad compared to the set of all generated playlists, but he or she would have no basis to judge them as such). A one-bit measurement (relevant or irrelevant) has really paid off in text retrieval. I expect it would pay off in music retrieval as well.
A one bit measure of overall similarity may be tricky to come by. But there are plenty of one bit musical features that could be measured quickly and pretty consistently, and tracks with a sufficient proportion of these in common can reasonably be called similar. This is presumably how the folks at pandora.com manage to process a new track in 20 seconds. Could this approach work?
- Logan and Salomon (ICME 2001), A Music Similarity Function Based On Signal Analysis.
One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query. A must read!
- Ellis, Whitman, Berenzweig, and Lawrence (ISMIR 2002), The Quest for Ground Truth in Music Similarity.
The MusicSeer survey is reported. (MusicSeer was a very clever way to get lots of users to rate artists by similarity.)
- Berenzweig, Logan, Ellis, and Whitman (ISMIR 2003), A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures.
Artist similarity measures are evaluated based on data from All Music Guide, from a survey (musicseer.com), and from playlists and personal collections.
- Logan, Ellis, and Berenzweig (SIGIR 2003), Toward Evaluation Techniques for Music Similarity.
Evaluating artist similarity (similar to the ISMIR 2003 version).
- Aucouturier and Pachet (JNRSAS 2004), Timbre Similarity: How high is the sky?.
Follow up to their ISMIR 2002 paper. Contains detailed results of experiments on the optimization of spectral similarity. Reports a glass ceiling. Excellent article!
- Pampalk, Flexer, and Widmer (ISMIR 2005), Improvements of Audio-based Music Similarity and Genre Classification.
The need for an artist filter (ie, not having the same artists in the test and training set) is described in this paper.
- Vignoli and Pauws (ISMIR 2005), A Music Retrieval System Based on User-Driven Similarity and its Evaluation.
User evaluation based on a playlist generation system (which partly uses audio-based similarity).
Opt-in survey of Audio music similarity researchers
In this section we would like to take a brief 'opt-in' survey of researchers actively working in this field. Please feel free to add yourself to the list (or email your details to the moderators listed above).