2006:Audio Music Similarity and Retrieval

1 Overview
2 Moderators
3 Introduction
4 Evaluation Summary
5 The Evaluation Database
6 Evaluation Methodology
7 Important threads on the discussion list
8 Related Papers
9 Opt-in survey of Audio music similarity researchers

Overview

This page is devoted to discussions of the evaluation of Audio Music Similarity algorithms at MIREX 2006. Discussions have already begun on the MIREX 06 "AudioSim06" contest planning list and will be briefly digested here. A full digest of the discussions is available to subscribers from the MIREX 06 "AudioSim06" contest planning list archives.

As consensus is achieved on the planning list, a full proposal (Audio Music Similarity proposal) will be produced for the format of the evaluation, including pseudocode for the evaluation metric and suggested formats for submitted algorithms. A skeleton of proposal is already available on the Audio Music Similarity proposal page.

Moderators

Kris West (University of East Anglia, UK) - kw@cmp.uea.ac.uk
Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - elias.pampalk@gmail.com
Paul Lamere (Sun Microsystems Laboratories, USA) - paul.lamere@sun.com

Introduction

Although the automatic extraction of genre and artist labels from audio are interesting tasks, I (KW) believe that they are often used to evaluate more general music similarity techniques that compare two songs based on their audio content. These techniques are hard to evaluate directly, for example with listening tests, as it is not practical to have a human listener rank the similarities of even a small test collection for a number of queries, which might require many hours of listening. Therefore, We have begun discussion of other methods of evaluating music similarity techniques, such as the methods described in Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs were retrieved and the average number of songs in the same genre, from the same artist and from the same album calculated and more practical methods of subjective evaluation of similarity estimators (i.e. evaluation of performance, rather than comparison of output to that of human annotators). This evaluation could be extended to multiple genres if data is available. I believe it is also important that we evaluate other characteristics of these algorithms, such as the descriptor extraction time, query time and memory footprint (which may indicate the applicability of a technique to an application).

This page serves as a summary of the discussions held on the AudioSim06 mailing list and will eventually hold a final evaluation proposal for MIREX 2006.

Evaluation Summary

We will be soliciting contribution to two distinct tracks: Audio Music Search & Cover Song search
The division between these two tracks will be emphasized in the evaluation results, although results will be directly comparable (all evaluations will be performed for both tracks).
The intention of the Music Audio Search track is to evaluate music similarity searches (A music search engine that takes a single song as a query), not playlist generation or music recommendation.
Any criteria may be used to implement the search although we are not really considering socio-cultural context or lyrics.
Any models, codebooks etc. *must* be trained in advance. No collection specific optimisations should be used.
Please avoid use of the USPOP collection in your training (sorry if this is a bit harsh on you DAn) as it will form part of the test database. Please also avoid any other overlap with the test data that you can identify.

The Evaluation Database

The specifications of the evaluation database will be as follows:

22.05kHz, mono, 16bit, WAV files
The WAV files will be decoded from 192kbit Variable-bit-rate, stereo, 44.1kHz, MP3 files, produced with the lame codec
Will contain ~5000 tracks
Selected from both the USPOP and USCRAP collections
No tracks shorter than 30 seconds
No tracks longer than 10 minutes
A maximum of 20 tracks per artist
A minimum of 50 tracks per labelled genre
Will contain the ~350 songs from the IMIRSEL cover songs collections (30 distinct pieces - ~10-12 versions of each)
Cover songs, USPOP and USCRAP files will all be handled in the exactly the same way (archival quality copy > 192k VBR MP3 > 22kHz WAV

Evaluation Methodology

Three Distinct evaluations will be performed

Human Evaluation
Evaluation according to cover song matches
Objective statistics derived from the distance matrix

Human Evaluation

The primary evaluation will involve subjective judgments by human evaluators of the retrieved sets using Stephen Downie's Evalutron 6000.

Evaluator question: Given a search based on track A, which of these two tracks (B or C) is a better result. (Note, there is still some question as to whether using binary relative comparisons is a viable approach when the amount of comparisons required is considerd)
~40 randomly selected queries, 5 results per query, 3 sets of eyes, ~10 participating labs
Higher number of queries preferred as IR research indicates variance is in queries
The cover songs and songs by the same artist as the query will be filtered out of each result list to avoid colouring an evaluators judgement (a cover song or song by the same artist in a result list is likely to reduce the relative ranking of other similar but independent songs - use of songs by the same artist may allow over-fitting to affect the results)
These numbers can change as we are extracting the full distance matrices
It might be possible for researchers to use this data for other types of system comparisons after MIREX 2006 results have been finalized.
Human evaluation to be designed and led by IMIRSEL
Human evaluators will be drawn from the participating labs (and any volunteers from IMIRSEL or on the MIREX lists)

Evaluation According to Cover song matches

 ( need more detail here )

Objective Statistics dervied from the distance matrix

Statistics of each distance matrix will be calculated including:

Average % of Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Precision at 5, 10, 20 & 50
Average % of Genre matches in the top 5, 10, 20 & 50 results after artist filtering of results
Average % of available Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Recall at 5, 10, 20 & 50 (just normalising scores when less than 20 matches for an artist, album or genre are available in the database)
Normalised average distance between examples of the same Genre, Artist or Album
Always similar - Maximum # times a file was in the top 5, 10, 20 & 50 results
% File never similar (never in a top 5, 10, 20 & 50 result list)
% of song triplets where triangular inequality holds
Plot of the "number of times similar curve" - plot of song number vs. number of times it appeared in a top 20 list with songs sorted according to number times it appeared in a top 20 list (to produce the curve). Systems with a sharp rise at the end of this plot have "hubs", while a long 'zero' tail shows many never similar results.
Ratio of the average artist distance to the average genre distance

Important threads on the discussion list

Contributors:

Kris West
Paul Lamere
Elias Pampalk
Fabian M├╢rchen
George Tzanetakis
Dan Ellis
Stephen Green
Rebecca Fiebrink
Mark Levy
Hamish Allan
Anders Meng
Adam Lindsay

Context related issues in music similarity evaluation

Viewpoints

George Tzanetakis

1) Is the notion of music similarity consistent between different humans/cultures/music education etc ? One thing I know for sure is that to all of you most pieces of folk music from the island of Crete would sound very similar whereas to people from Crete (including me) they sound completely unique.

2) Does it even make sense to speak of similarity as a one dimensional quantity ? For example is the dance version of Carmina Burana more similar to a classical recording of Carmina Burana or to another dance piece using the same drum loop.

3) Can similarity be context-independent ? Similarity only makes sense relative to a particular context. Billie Holiday is very different from Ella Fitzgerald in a context of female jazz singer however might be perceived as very similar in a general context of female singers including Britney Spears and Anni DiFranco.

Anders Meng

Each of us consider similarity in a very different manner. Consider the scenario of human ranking of playlists according to similarity. The ranking we would get out of this "non-guided" (flat prior) similarity evaluation would be a kind of average ranking, since each user ranks after his preferences: E.g. user 1 might rank after vocal similarity, while user 2 ranks after instrument similarity, and so on. Perhaps it would be beneficial for the end user of some fancy music retrieval system in the future either to find music based on a kind of "average" similarity (which I guess a lot of people would be happy with) or perhaps be able to be select his/her dimension of similarity, say : "Vocal similarity, like bono....". Perhaps a multidimensional similarity evaluation will be possible next year, but would almost certainly have to involve the generation of ground-truth through subjecctive similarity judgements.

Elias Pampalk

perceived similarity is:

subjective
context dependant
multi-dimensional

I don't think anyone disagrees :-)

yet, I have no doubts, that by comparing the performances of algorithms which compute:

"objective"
context independent,
one-dimensional,

similarity ratings, we could benefit a lot.

such algorithms are something very practical. e.g. MTG recently announced that they were making money with such a technology as part of their "music surfer" product.

I'd really like to argue for keeping things as simple as possible. It's the first year we are trying this. I doubt that we could possibly keep it too simple, but if we do we could easily improve things next time.

Dan Ellis

The issue of 'context-dependent' similarity is not so hard to deal with. If I give you one track and ask for the most similar ones, there's no context. But if I give you 20, the spread within the set defines the context, and the algorithms can try to infer it. So if those tracks all have Paul McCartney singing, or if they all use cellos, or if they all use a simple tonic-subdominant-dominant chord progression, it seems a well-formed problem to ask an algorithm to infer the correct aspect on which to perform matching.

Predicting human-generated playlists is one possible way to frame this. By trawling the web, or scraping people's iTunes databases, you can get a lot of sets of songs that people have lumped together for one reason or another, representative of the real spread of 'contexts' relevant to users. Having algorithms attempt to complete the rest of a playlist given the 1st half is something you can measure, even if performance is bound to be low in absolute terms. I'm not frightened of low absolute performance provided there is still measurable difference between different systems.

Kris West

Another observation worth making is that a relatively small dataset (e.g. ~5000 tracks), where both data and metadata come from a common source (e.g. a single record label), defines its own limited context as you can reasonably expect that the genre classifications (used to organise the collection) have been applied in a consistent manner without outliers. The likelihood of this being the case decreases as the collection size rises, as it would require more editors and more judgements to organise.

Collection specific learning

It appears that music similarity estimators can be roughly divided into three groups, based the types of data that they leverage: purely content-based, augmented with behavioural data (such as skipping behaviour, playlist co-occurence etc.), and trained (content-based with Collection specific parameters estimated from a labelled subset of the database or an independent database). Purely content-based estimators appear to be the default mode for this evaluation. However, evaluation of trained submissions should be possible. Assuming an evaluation database size of 4000+ examples, 1000 - 1500 examples could be held back for training. Examples should be selected from the database in the same proportions that they occur. It might also be interesting to evaluate two copies of trained algorithms, one trained on the subset from the database and another trained on a separate dataset (greater differences in performance based on the two training sets may indicate overfitting to the collection, while smaller differences may indicate better generalisation).

Fabio Vignoli

Task related similarity

The focus on context is in the previous paragraph referred to the other songs available in the collections. There is however an other type of context that should be taken into account when comparing music similarity measures and it has been mixed up: the application context. What is the purpose of a music similarity algorithm? I think one of the following:

Browsing a music collection that is known to the user
Exploring a music collection that is unknown to the user
Creating playlists

The 3rd point has a completely different flavour from the other two. --Vignoli 08:28, 20 December 2005 (CST)

Fabian M├╢rchen

I consider collection specific learning very important. The more recent discussion on this list on using several ground truths also seems to support this. I doubt that someone can find a description of musical audio content that solves all the described problems on all types of music. A good system for audio similarity should be able to learn a ground truth given genre or artist or user preference or something else as ground truth and be able to approximately reproduce it on similar data, as long as the ground truth is music related. ;)

Kris West

A better definition of context?

Fabio makes an excellent point above, context is poorly defined for the discussion of Music query-by-example systems. It seems to me that a much clearer division can be drawn between a specific query and the context within which it is performed. The context of a query may be composed of one or more of the following:

A User (the user's musical knowledge, tastes or behaviour)
A specific collection (How is music organised/categorised/clustered within the collection? Catalog owner's musical knowledge/behaviour, closely related to above)
Culture or Genre (Musical knowledge/conventions within the culture or genre, e.g. Western music, Western Classical music, Rock music, Greek traditional music. No sense making a Pop music query within the context of Greek traditional music unless trying to emulate tastes of user almost exclusively interested in Greek traditional music)

and will often be fixed for a given retrieval system. If the end user's cultural context can be assummed (e.g. western music), you might call these "objective" similarity estimators.

I believe any other context-like restrictions on the query are part of the specific query and may be explicitly defined or implied. Explicitly defined queries (which appear to be confused with contexts more often than implicitly defined queries) include:

Find songs that are rhythmically similar to ...
Find songs that have female vocals in them
Find songs that are acoustically similar to (sound like) ...

whereas implicitly defined contexts iclude:

Find songs similar to A (our basic query)
Find songs similar to A || B || C ... (basic OR query)
Find songs similar to A && B && C... (AND query, what is similar about A, B and C, what other other songs are similar in this way?)
- Could also be formulated as: Find songs that are to A as B is to C (B & C define a relationship, i.e they may be rhythmically similar or have the same vocalist or instrumentation, this relationship is then used to find songs related to A in this way)
Find songs dis-similar to A, !A (NOT query, filtering?)
- Find songs similar to A and dis-similar to B, A && !B

What can we evaluate?

Evaluation of a system that works within the context of an individual user's tastes is problematic as any judgements about performance must be subjective. Evaluation of systems working within the context of a collection is can be objectively evaluated on their ability to reproduce the organisation of the collection (this approach could also be used to evaluate systems working with the context of user, by asking the user to orgaise the collection and attempting to reproduce that organisation). This implies that collection specific learning (and therefore a training set) should be allowed. Systems intended to work within the context of a music culture (such as western music) can also be evaluated on their ability to reproduce the organisation of a collection or set of collections of music. In this case a training set would not be used, or a training set which is completely independent of the test set is used (i.e. from a different source).

Assuming the training set is ignored by systems intended to work within the context of western musical culture, we can directly compare results from collection specific systems and so-called "objective" systems on the same collection (althought the distinction should be indicated in the results.

Types of evaluation

There have been a number of papers describing similarity evaluation, including those by Whitman, Berenzweig, Ellis and Logan. The methods used generally fall into the following buckets:

Subjective precision via user tests
Expert opinion (similar artist lists from music editors like All Music Guide)
Playlist Co-occurrence
User Collection Co-occurrence
objective statistics based upon album, artist and genre labels. (TopN, average distance)

For a standard, annual evaluation like MIREX, the first four types of evaluations seem problematic.

1 - subjective precision - is very expensive to collect this data for alarge music collection, and would likely be unreliable unless many users were evaluated.

2 - Expert opinion - expert opinion will usually rate similarity of artists but not songs. Also, not transitive, coldplay may sound like the Beatles, but no one ever says the beatles sound like coldplay. This data generally only exists for popular artists (i.e. not for artists typically found in 'free' collections of music).

3,4 - Playlist Co-occurrence, User Collection co-occurrence - works for popular music, but usually not enough coverage for less popular music, generally not suitable for our test collections (such as magnatune or epitonic), since this music is not listened to by enough people.

Factors to evaluate

feature extraction time
distance computation time
memory consumption durring feature extraction
memeory consumption durring distance computation

objective statistics based upon genre (with artist filter), artist & album labels:

closest 1 (ratio of pieces in same genre/artist/album as query)
closest 5 (-"-)
closest 10 (-"-)
closest 20 (-"-)
clustering performance (e.g. ratio of the average intra-genre distance to the average inter-genre distance, see below)

3 and 4 are difficult to evaluate: (KW) These may be slightly problematic due to the very varied formats in which algortihms are submitted. I'll see what I can come up with. Comments from smart Linux/Unix/Mac bods on how to do this sort of measurement (perhaps even with manual entry of PID) would be welcomed. I haven't the faintest idea how to do it on Windows other than manually with the task manager.

Objective statistics based upon album, artist and genre labels proposal

Use methods described in Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs are retrieved and the average number of songs in the same genre, from the same artist and from the same album are calculated.

An additional evalaution of the clustering performance of an algorithm could be calculated as the ratio of the average intra-genre distance (within-class scatter or cohesion) to the average inter-genre distance (between class scatter or separation).

Justification for using artist and album labels

At Sun Labs we've been auditioning a number of different similarity models. We've had some models that behaved in ways that are apparently similar to your 'spectrum histogram', in that they yield good objective scores (a high percentage of songs in the top 20 are of the proper genre, the average intra-genre distance is low compared to the overall average distance), but when using the models for actual playlist generation or visualization we'd experience the similar 'space-time distortions'. The 75% of the songs might be of the proper genre, but the other 25% would be way off. We call them 'clunkers', songs that no human would say belongs in the playlist of similar songs. Also, we'd see other similar problems, where the songs in the 75% of the proper genre were not really very similar. A folk rock song would be 'near' a punk rock song, or a choral piece would be near a harpsichord piece. One way of reducing this problem would be to take into account the artist and album metadata in the evaluation. Presumably there are three enclosing clusters: at the coarsest level is a genre cluster, within this cluster we could expect to find multiple artist clusters, and within an artist cluster we may find multiple album clusters. Now, if it turns out that the similarity classifier being evaluated really makes no distinction of nearness within the genre (e.g choral and harpsichord music can be 'near'), then the ratio of the average artist distance to average genre distance will approach one (the artist cluster is as large as the genre cluster), but if this ratio is much less than one then we have small artist clusters within the genre cluster. The harpsichord music has clustered together and separated it from the choral music performed by a different artist. The album clustering can give another, finer-grained level (but I'm not convinced that this extra level is necessary).

To sum up then, the average artist distance compared to the average genre distance gives a notion of how well a similarity metric is working within a single genre.

Volatility of objective statistics and clustering metric

It has been suggested that the use of the mean average in the calculation of the above statistics is too volatile as it is influenced by outliers and we should therefore use the median average. Outliers easily sneak into any ground truth or can be caused by choosing a bad segment of the song. However it should be noted that if a participant selects a poor segment from a song that is indicative of the performance of the assumption that you can randomly select a representative sample from a piece of music (an intelligent segmentation or thumbnailing technique might have a significant advantage). However, if a poor query segment is selected for the evaluation, it will effect all submissions equally. If a particular submission handles a poor query better it should achieve a better evaluation score, rather than have that additional performance averaged out. I.e. if an algorithm doesn't produce as many outliers, that fact should represented in the evaluation score.

The use of the median average will most likely improve all calculated ratios, but will reduce the difference between algorithms. A more useful alternative may be the trimmed mean (remove 1 - 2% of results from both ends of each distribution then calculate mean). It has also been commented that, in generated playlists, outliers can ruin the perception of the performance of the rest of the generated playlist and therefore the use of a metric which deliberately ignores outliers is highly inappropriate (this applies to both the median and trimmed mean).

Another interesting statistic maybe the difference between the mean and median statistics as a lower value should indicate that the algorithm handles outliers well (is less volatile), while a higher value will indicate more outliers in the distribution.

Subjective evaluation

Here's another possibility for evaluation. I notice that they are now around 60 people subscribed to this email list. I imagine that this will grow to at least 100 or so by the time the mirex 2006 evaluation begins. With this many people, perhaps we can do some real human evalutions of submitted systems. If we can get each person on this list to agree to evaluate 10 playlists and rate them on a scale of 1 to 5 in terms of how similar the songs are, then we can get 1000 playlists evaluated. If 50 systems are submitted, that means we can have 20 different people look at each playlist. Here's what we could do:

Each system would be given a seed song and asked to generate a 'playlist' of the 10 most similar songs (all systems would receive the same seed song)
The lists would be anonymized (song and artists removed from titles to prevent bias from the metadata)
The playlists would be randomly distributed to the set of evaluators to be scored (a web interface would suffice).
Evaluators assign a 1 to 5 rating to each of their ten assigned playlists
The average score for each playlist would recorded as part of the evaluation
The resulting playlists can be made available for browsing after the evaluation results are published.

Since music similarity is such a human quantity I think that it is important to have some user evaluation of similarty as part of MIREX. Thoughts?

Alternatives to rating on a scale

A pairwise comparison approach ("Is playlist A better or worse than playlist B?") might be more appropriate than rating on a scale (where, for example, individuals might prefer or avoid the middle of the scale by nature, or where all ten of someone's playlists could be in reality very bad compared to the set of all generated playlists, but he or she would have no basis to judge them as such). A one-bit measurement (relevant or irrelevant) has really paid off in text retrieval. I expect it would pay off in music retrieval as well.

A one bit measure of overall similarity may be tricky to come by. But there are plenty of one bit musical features that could be measured quickly and pretty consistently, and tracks with a sufficient proportion of these in common can reasonably be called similar. This is presumably how the folks at pandora.com manage to process a new track in 20 seconds. Could this approach work?

Related Papers

Logan and Salomon (ICME 2001), A Music Similarity Function Based On Signal Analysis.
One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query. A must read!
Aucouturier and Pachet (ISMIR 2002), Music Similarity Measures : WhatΓÇÖs the use?.
Similar in some ways to the work of Logan and Salomon. Evaluation includes percentage of retrieved songs in the same genre (for top 1, 5, 10, 20, 100 songs) and some cluster (genre) overlap measures. Excellent paper!
Ellis, Whitman, Berenzweig, and Lawrence (ISMIR 2002), The Quest for Ground Truth in Music Similarity.
The MusicSeer survey is reported. (MusicSeer was a very clever way to get lots of users to rate artists by similarity.)
Berenzweig, Logan, Ellis, and Whitman (ISMIR 2003), A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures.
Artist similarity measures are evaluated based on data from All Music Guide, from a survey (musicseer.com), and from playlists and personal collections.
Logan, Ellis, and Berenzweig (SIGIR 2003), Toward Evaluation Techniques for Music Similarity.
Evaluating artist similarity (similar to the ISMIR 2003 version).
Pampalk, Dixon, and Widmer (DAFx 2003), On the Evaluation of Perceptual Similarity Measures for Music
An attempt was made to compare similarity measures published by different authors. Artist, album, tones, style, and genre (the last three from AMG) were used for the evaluations. The average distance between all songs vs the average distance within a group (e.g. genre) was used as quality criteria.
Aucouturier and Pachet (JNRSAS 2004), Timbre Similarity: How high is the sky?.
Follow up to their ISMIR 2002 paper. Contains detailed results of experiments on the optimization of spectral similarity. Reports a glass ceiling. Excellent article!
Pampalk, Flexer, and Widmer (ISMIR 2005), Improvements of Audio-based Music Similarity and Genre Classification.
The need for an artist filter (ie, not having the same artists in the test and training set) is described in this paper.
Vignoli and Pauws (ISMIR 2005), A Music Retrieval System Based on User-Driven Similarity and its Evaluation.
User evaluation based on a playlist generation system (which partly uses audio-based similarity).

Opt-in survey of Audio music similarity researchers

In this section we would like to take a brief 'opt-in' survey of researchers actively working in this field. Please feel free to add yourself to the list (or email your details to the moderators listed above).

Kris West (University of East Anglia, UK) - homepage publications
Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - homepage publications
Paul Lamere (Sun Labs, Sun Microsystems) - Project overview
Rebecca Fiebrink (McGill University, Montreal) - homepage

2006:Audio Music Similarity and Retrieval

Contents

Overview

Moderators

Introduction

Evaluation Summary

The Evaluation Database

Evaluation Methodology

Human Evaluation

Evaluation According to Cover song matches

Objective Statistics dervied from the distance matrix

Important threads on the discussion list

Context related issues in music similarity evaluation

Viewpoints

George Tzanetakis

Anders Meng

Elias Pampalk

Dan Ellis

Kris West

Collection specific learning

Fabio Vignoli

Task related similarity

Fabian M├╢rchen

Kris West

A better definition of context?

What can we evaluate?

Types of evaluation

Factors to evaluate

Objective statistics based upon album, artist and genre labels proposal

Justification for using artist and album labels

Volatility of objective statistics and clustering metric

Subjective evaluation

Alternatives to rating on a scale

Related Papers

Opt-in survey of Audio music similarity researchers

Navigation menu

Search