2006:Audio Music Similarity and Retrieval

Introduction

As the size of digitial music collections grow, music similarity has an increasingly important role as an aid to music discovery. A music similarity system can help a music consumer find new music by finding the music that is most musically similar to specific query songs (or is nearest to songs that the consumer already likes). However, evaluating music similarity systems is inherently difficult since music similarity is subjective characteristic. Some attempts have been made to evaluate similarity systems with some degree of success (See the #Related Papers). In general, due, to the lack of ground truth for music similarity, other evaluations such as genre and artist identification have been used as surrogates for music similarity.

This year, for the first time, we are attempting a large scale music similarity evaluation. This evaluation will rely primarily on human judgement to rank the various submission. We will also collect various objective measures related to genre and artist clustering, with the hope that we will find some correlation between these objective measures and the human evaluations.

This page presents the Audio Music Similarity Evaluation, including the submission rules and formats. Additionally background information can be found here that should help explain some of the reasoning behind the approach taken in the evaluation.

Evaluation Summary:

We will be soliciting contribution to two distinct tracks: Audio Music Search & Audio Cover Song
The division between these two tracks will be emphasized in the evaluation results, although results will be directly comparable (all evaluations will be performed for both tracks).
The intention of the Music Audio Search track is to evaluate music similarity searches (A music search engine that takes a single song as a query), not playlist generation or music recommendation.
Any criteria may be used to implement the search although we are not really considering socio-cultural context or lyrics.
Any models, codebooks etc. must be trained in advance. No collection specific optimisations should be used.
Please avoid use of the [USPOP] and USCRAP collections in your training as they will form the test database. Please also avoid any other overlap with the test data that you can identify.

Outstanding Issues

There are still a few issues that need to be resolved before this task is finalized:

Binary relative judgements vs. Absolute - we've had much debate over this. Most researchers seem to prever the binary relative approach, but that places an extreme burden on the evaluators.
We are awaiting the final description if the Evalutron

The Evaluation Database

The specifications of the evaluation database will be as follows:

22.05kHz, mono, 16bit, WAV files
The WAV files will be decoded from 192kbit Variable-bit-rate, stereo, 44.1kHz, MP3 files, produced with the lame codec
Will contain ~5000 tracks
Selected from both the USPOP and USCRAP collections
No tracks shorter than 30 seconds
No tracks longer than 10 minutes
A maximum of 20 tracks per artist
A minimum of 50 tracks per labelled genre
Will contain the ~350 songs from the IMIRSEL cover songs collections (30 distinct pieces - ~10-12 versions of each)
Cover songs, USPOP and USCRAP files will all be handled in the exactly the same way (archival quality copy > 192k VBR MP3 > 22kHz WAV

Evaluation Methodology

Three Distinct evaluations will be performed

Human Evaluation
Evaluation according to cover song matches
Objective statistics derived from the distance matrix

Human Evaluation

The primary evaluation will involve subjective judgments by human evaluators of the retrieved sets using Stephen Downie's Evalutron 6000 (Final description of the Evalutron is pending).

Evaluator question: Given a search based on track A, which of these two tracks (B or C) is a better result. (Note, there is still some question as to whether using binary relative comparisons is a viable approach when the amount of comparisons required is considerd)
~40 randomly selected queries, 5 results per query, 3 sets of eyes, ~10 participating labs
Higher number of queries preferred as IR research indicates variance is in queries
The cover songs and songs by the same artist as the query will be filtered out of each result list to avoid colouring an evaluators judgement (a cover song or song by the same artist in a result list is likely to reduce the relative ranking of other similar but independent songs - use of songs by the same artist may allow over-fitting to affect the results)
These numbers can change as we are extracting the full distance matrices
It might be possible for researchers to use this data for other types of system comparisons after MIREX 2006 results have been finalized.
Human evaluation to be designed and led by IMIRSEL
Human evaluators will be drawn from the participating labs (and any volunteers from IMIRSEL or on the MIREX lists)

Evaluation According to Cover song matches

See the page Audio Cover Song for information on how the Audio Cover Song task will be evaluated.

Objective Statistics derived from the distance matrix

Statistics of each distance matrix will be calculated including:

Average % of Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Precision at 5, 10, 20 & 50
Average % of Genre matches in the top 5, 10, 20 & 50 results after artist filtering of results
Average % of available Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Recall at 5, 10, 20 & 50 (just normalising scores when less than 20 matches for an artist, album or genre are available in the database)
Normalised average distance between examples of the same Genre, Artist or Album
Always similar - Maximum # times a file was in the top 5, 10, 20 & 50 results
% File never similar (never in a top 5, 10, 20 & 50 result list)
% of song triplets where triangular inequality holds
Plot of the "number of times similar curve" - plot of song number vs. number of times it appeared in a top 20 list with songs sorted according to number times it appeared in a top 20 list (to produce the curve). Systems with a sharp rise at the end of this plot have "hubs", while a long 'zero' tail shows many never similar results.
Ratio of the average artist distance to the average genre distance

Additional Data Reported

Runtimes - Where possible accurate runtimes will be recorded. The two call format allows separation of feature extraction/indexing runtimes from the final query runtimes.

Submission Format

A submission to the Audio Music Similarity and Retrieval evaluation is expected to follow the Best Coding Practices for MIREX and must conform to the following for execution:

One Call Format

The one call format is appropriate for systems that perform all phases of the evaluation (typically features extraction, and evaluation) in one step. A submission should be an executable program that takes 3 arguments:

path/to/fileContainingListOfAudioFiles - the path to the list of audio files (seen the format below)
path/to/cacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
path/to/output/DistMatrix - the file where the output distance matrix should be placed. The format is described below

Example:


doAudioSim "path/to/fileContainingListOfFilesToInDB" "path/to/cacheDir" "path/to/output/DistMatrix"

Two Call Format

The two call format is appropriate for systems that break their processing into two phases (typically a feature extraction phase and an evaluation phase. The submission should consist of two programs:

doFeatureExtraction - this takes two arguments:
- path/to/fileContainingListOfAudioFiles - the path to the ldist of audio files (seen the format below)
- path/to/cacheDir - a directory where the submission can place the output of the first stage
outputDistMatrix - this takes two arguments
- path/to/cacheDir - a directory where the first stage has placed its output.
- path/to/output/DistMatrix - the file where the output distance matrix should be placed. The format is described below.

Example:


doFeatureExtraction "path/to/fileContainingListOfFilesOfAudioFiles" "path/to/cacheDir"
outputDistMatrix "path/to/cacheDir" "path/to/output/DistMatrix

Matlab format

Matlab will also be supported in the form of functions in the following formats:

Matlab One call format

doMyMatlabAudioSim('path/to/fileContainingListOfAudioFiles','path/to/cacheDir','path/to/output/DistMatrix

Matlab Two call format

doMyMatlabFeatureExtraction('path/to/fileContainingListOfAudioFiles','path/to/cacheDir')
doMyMatlabOutputDistMatrix('path/to/cacheDir','path/to/output/DistMatrix')

File Formats

Input File

The input list file format will be of the form:

path/to/audio/file/000001.wav
path/to/audio/file/000002.wav
path/to/audio/file/000003.wav
...
path/to/audio/file/00000N.wav

Output File

The only output will be a distance matrix file in the following format:

Example distance matrix 0.1 (replace this line with your system name)
1    path/to/audio/file/1.wav
2    path/to/audio/file/2.wav
3    path/to/audio/file/3.wav
...
N    path/to/audio/file/N.wav
Q/R    1        2        3        ...        N
1    0.0      1.241    0.2e-4     ...    0.4255934
2    1.241    0.000    0.6264     ...    0.2356447
3    50.2e-4  0.6264   0.0000     ...    0.3800000
...    ...    ...      ...        ...    0.7172300
5    0.42559  0.23567  0.38       ...    0.000

All distances should be zero or positive (0.0+) and should not be infinite or NaN. Values should be separated by 1 or more characters of whitespace.

Packaging your Submission

Be sure to follow the [Best Coding Practices for MIREX]
Be sure to follow the [MIREX Submission Instructions]
Submit your system via the [Submission ] page
In the README file that is included with your submission, answer the following questions:

1 Was the submission trained or tuned? 2 What was the source of the data used? 3 Are you aware of any overlap between the data and either the USCRAP, USPOP or Cover song collections?

Evaluation Background

Numerous discussion have taken place on the AudioSim mailing list. Some of the threads are summarized here: Important threads for Audio Similarity

Related Papers

Logan and Salomon (ICME 2001), A Music Similarity Function Based On Signal Analysis.
One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query. A must read!
Aucouturier and Pachet (ISMIR 2002), Music Similarity Measures : WhatΓÇÖs the use?.
Similar in some ways to the work of Logan and Salomon. Evaluation includes percentage of retrieved songs in the same genre (for top 1, 5, 10, 20, 100 songs) and some cluster (genre) overlap measures. Excellent paper!
Ellis, Whitman, Berenzweig, and Lawrence (ISMIR 2002), The Quest for Ground Truth in Music Similarity.
The MusicSeer survey is reported. (MusicSeer was a very clever way to get lots of users to rate artists by similarity.)
Berenzweig, Logan, Ellis, and Whitman (ISMIR 2003), A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures.
Artist similarity measures are evaluated based on data from All Music Guide, from a survey (musicseer.com), and from playlists and personal collections.
Logan, Ellis, and Berenzweig (SIGIR 2003), Toward Evaluation Techniques for Music Similarity.
Evaluating artist similarity (similar to the ISMIR 2003 version).
Pampalk, Dixon, and Widmer (DAFx 2003), On the Evaluation of Perceptual Similarity Measures for Music
An attempt was made to compare similarity measures published by different authors. Artist, album, tones, style, and genre (the last three from AMG) were used for the evaluations. The average distance between all songs vs the average distance within a group (e.g. genre) was used as quality criteria.
Aucouturier and Pachet (JNRSAS 2004), Timbre Similarity: How high is the sky?.
Follow up to their ISMIR 2002 paper. Contains detailed results of experiments on the optimization of spectral similarity. Reports a glass ceiling. Excellent article!
Pampalk, Flexer, and Widmer (ISMIR 2005), Improvements of Audio-based Music Similarity and Genre Classification.
The need for an artist filter (ie, not having the same artists in the test and training set) is described in this paper.
Vignoli and Pauws (ISMIR 2005), A Music Retrieval System Based on User-Driven Similarity and its Evaluation.
User evaluation based on a playlist generation system (which partly uses audio-based similarity).

Opt-in survey of Audio music similarity researchers

In this section we would like to take a brief 'opt-in' survey of researchers actively working in this field. Please feel free to add yourself to the list (or email your details to the moderators listed above).

Kris West (University of East Anglia, UK) - homepage publications
Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - homepage publications
Paul Lamere (Sun Labs, Sun Microsystems) - Project overview
Rebecca Fiebrink (McGill University, Montreal) - homepage

Moderators

Kris West (University of East Anglia, UK) - kw@cmp.uea.ac.uk
Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - elias.pampalk@gmail.com
Paul Lamere (Sun Microsystems Laboratories, USA) - paul.lamere@sun.com

2006:Audio Music Similarity and Retrieval

Contents