Difference between revisions of "2006:Audio Music Similarity and Retrieval"

Revision as of 09:43, 21 July 2006

Overview

This page presents the Audio Music Similarity Evaluation, including the submission rules and formats. Additionally background information can be found here that should help explain some of the reasoning behind the approach taken in the evaluation.

Introduction

Although the automatic extraction of genre and artist labels from audio are interesting tasks, they are often used to evaluate more general music similarity techniques that compare two songs based on their audio content. These techniques are hard to evaluate directly, for example with listening tests, as it is not practical to have a human listener rank the similarities of even a small test collection for a number of queries, which might require many hours of listening. Therefore, as part of MIREX 2006 we will be directly evaluating music similarity techniques. Since music similarity is a human notion, we will be relying primariliy on human evaluation of generated similarity lists as the primary method of evaluating music similarity.

Evaluation Summary

We will be soliciting contribution to two distinct tracks: Audio Music Search & Cover Song search
The division between these two tracks will be emphasized in the evaluation results, although results will be directly comparable (all evaluations will be performed for both tracks).
The intention of the Music Audio Search track is to evaluate music similarity searches (A music search engine that takes a single song as a query), not playlist generation or music recommendation.
Any criteria may be used to implement the search although we are not really considering socio-cultural context or lyrics.
Any models, codebooks etc. *must* be trained in advance. No collection specific optimisations should be used.
Please avoid use of the USPOP collection in your training (sorry if this is a bit harsh on you DAn) as it will form part of the test database. Please also avoid any other overlap with the test data that you can identify.

The Evaluation Database

The specifications of the evaluation database will be as follows:

22.05kHz, mono, 16bit, WAV files
The WAV files will be decoded from 192kbit Variable-bit-rate, stereo, 44.1kHz, MP3 files, produced with the lame codec
Will contain ~5000 tracks
Selected from both the USPOP and USCRAP collections
No tracks shorter than 30 seconds
No tracks longer than 10 minutes
A maximum of 20 tracks per artist
A minimum of 50 tracks per labelled genre
Will contain the ~350 songs from the IMIRSEL cover songs collections (30 distinct pieces - ~10-12 versions of each)
Cover songs, USPOP and USCRAP files will all be handled in the exactly the same way (archival quality copy > 192k VBR MP3 > 22kHz WAV

Evaluation Methodology

Three Distinct evaluations will be performed

Human Evaluation
Evaluation according to cover song matches
Objective statistics derived from the distance matrix

Human Evaluation

The primary evaluation will involve subjective judgments by human evaluators of the retrieved sets using Stephen Downie's Evalutron 6000.

Evaluator question: Given a search based on track A, which of these two tracks (B or C) is a better result. (Note, there is still some question as to whether using binary relative comparisons is a viable approach when the amount of comparisons required is considerd)
~40 randomly selected queries, 5 results per query, 3 sets of eyes, ~10 participating labs
Higher number of queries preferred as IR research indicates variance is in queries
The cover songs and songs by the same artist as the query will be filtered out of each result list to avoid colouring an evaluators judgement (a cover song or song by the same artist in a result list is likely to reduce the relative ranking of other similar but independent songs - use of songs by the same artist may allow over-fitting to affect the results)
These numbers can change as we are extracting the full distance matrices
It might be possible for researchers to use this data for other types of system comparisons after MIREX 2006 results have been finalized.
Human evaluation to be designed and led by IMIRSEL
Human evaluators will be drawn from the participating labs (and any volunteers from IMIRSEL or on the MIREX lists)

Evaluation According to Cover song matches

 ( need more detail here )

Objective Statistics derived from the distance matrix

Statistics of each distance matrix will be calculated including:

Average % of Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Precision at 5, 10, 20 & 50
Average % of Genre matches in the top 5, 10, 20 & 50 results after artist filtering of results
Average % of available Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Recall at 5, 10, 20 & 50 (just normalising scores when less than 20 matches for an artist, album or genre are available in the database)
Normalised average distance between examples of the same Genre, Artist or Album
Always similar - Maximum # times a file was in the top 5, 10, 20 & 50 results
% File never similar (never in a top 5, 10, 20 & 50 result list)
% of song triplets where triangular inequality holds
Plot of the "number of times similar curve" - plot of song number vs. number of times it appeared in a top 20 list with songs sorted according to number times it appeared in a top 20 list (to produce the curve). Systems with a sharp rise at the end of this plot have "hubs", while a long 'zero' tail shows many never similar results.
Ratio of the average artist distance to the average genre distance

Additional Data Reported

Runtimes - Where possible accurate runtimes will be recorded. The two call

format allows separation of feature extraction/indexing runtimes from the final query runtimes.

Submission Format

A submission to the Audio Music Similarity and Retrieval evaluation is expected to follow the Best Coding Practices for MIREX and must conform to the following for execution:

One Call Format

The one call format is appropriate for systems that perform all phases of the evaluation (typically features extraction, and evaluation) in one step. A submission should be an executable program that takes 3 arguments:

path/to/fileContainingListOfAudioFiles - the path to the ldist of audio files (seen the format below)
path/to/cacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
path/to/output/DistMatrix - the file where the output distance matrix should be placed. The format is described below

Example:


doAudioSim "path/to/fileContainingListOfFilesToInDB" "path/to/cacheDir" "path/to/output/DistMatrix"

Two Call Format

The two call format is appropriate for systems that break their processing into two phases (typically a feature extraction phase and an evaluation phase. The submission should consist of two programs:

doFeatureExtraction - this takes two arguments:
- path/to/fileContainingListOfAudioFiles - the path to the ldist of audio files (seen the format below)
- path/to/cacheDir - a directory where the submission can place the output of the first stage
outputDistMatrix - this takes two arguments
- path/to/cacheDir - a directory where the first stage has placed its output.
- path/to/output/DistMatrix - the file where the output distance matrix should be placed. The format is described below.

Example:


doFeatureExtraction "path/to/fileContainingListOfFilesOfAudioFiles" "path/to/cacheDir"
outputDistMatrix "path/to/cacheDir" "path/to/output/DistMatrix

Matlab format

Matlab will also be supported in the form of functions in the following formats:

Matlab One call format

doMyMatlabAudioSim('path/to/fileContainingListOfAudioFiles','path/to/cacheDir','path/to/output/DistMatrix

Matlab Two call format

doMyMatlabFeatureExtraction('path/to/fileContainingListOfAudioFiles','path/to/cacheDir')
doMyMatlabOutputDistMatrix('path/to/cacheDir','path/to/output/DistMatrix')

File Formats

Input File

The input list file format will be of the form:

path/to/audio/file/000001.wav
path/to/audio/file/000002.wav
path/to/audio/file/000003.wav
...
path/to/audio/file/00000N.wav

Output File

The only output will be a distance matrix file in the following format:

Example distance matrix 0.1 (replace this line with your system name)
1    path/to/audio/file/1.wav
2    path/to/audio/file/2.wav
3    path/to/audio/file/3.wav
...
N    path/to/audio/file/N.wav
Q/R    1        2        3        ...        N
1    0.0      1.241    0.2e-4     ...    0.4255934
2    1.241    0.000    0.6264     ...    0.2356447
3    50.2e-4  0.6264   0.0000     ...    0.3800000
...    ...    ...      ...        ...    0.7172300
5    0.42559  0.23567  0.38       ...    0.000

All distances should be zero or positive (0.0+) and should not be infinite or NaN. Values should be separated by 1 or more characters of whitespace.

Evaluation Background

Numerous discussion have taken place on the AudioSim mailing list. Some of the threads are summarized here: Important threads for Audio Similarity

Related Papers

Logan and Salomon (ICME 2001), A Music Similarity Function Based On Signal Analysis.
One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query. A must read!
Aucouturier and Pachet (ISMIR 2002), Music Similarity Measures : WhatΓÇÖs the use?.
Similar in some ways to the work of Logan and Salomon. Evaluation includes percentage of retrieved songs in the same genre (for top 1, 5, 10, 20, 100 songs) and some cluster (genre) overlap measures. Excellent paper!
Ellis, Whitman, Berenzweig, and Lawrence (ISMIR 2002), The Quest for Ground Truth in Music Similarity.
The MusicSeer survey is reported. (MusicSeer was a very clever way to get lots of users to rate artists by similarity.)
Berenzweig, Logan, Ellis, and Whitman (ISMIR 2003), A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures.
Artist similarity measures are evaluated based on data from All Music Guide, from a survey (musicseer.com), and from playlists and personal collections.
Logan, Ellis, and Berenzweig (SIGIR 2003), Toward Evaluation Techniques for Music Similarity.
Evaluating artist similarity (similar to the ISMIR 2003 version).
Pampalk, Dixon, and Widmer (DAFx 2003), On the Evaluation of Perceptual Similarity Measures for Music
An attempt was made to compare similarity measures published by different authors. Artist, album, tones, style, and genre (the last three from AMG) were used for the evaluations. The average distance between all songs vs the average distance within a group (e.g. genre) was used as quality criteria.
Aucouturier and Pachet (JNRSAS 2004), Timbre Similarity: How high is the sky?.
Follow up to their ISMIR 2002 paper. Contains detailed results of experiments on the optimization of spectral similarity. Reports a glass ceiling. Excellent article!
Pampalk, Flexer, and Widmer (ISMIR 2005), Improvements of Audio-based Music Similarity and Genre Classification.
The need for an artist filter (ie, not having the same artists in the test and training set) is described in this paper.
Vignoli and Pauws (ISMIR 2005), A Music Retrieval System Based on User-Driven Similarity and its Evaluation.
User evaluation based on a playlist generation system (which partly uses audio-based similarity).

Opt-in survey of Audio music similarity researchers

In this section we would like to take a brief 'opt-in' survey of researchers actively working in this field. Please feel free to add yourself to the list (or email your details to the moderators listed above).

Kris West (University of East Anglia, UK) - homepage publications
Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - homepage publications
Paul Lamere (Sun Labs, Sun Microsystems) - Project overview
Rebecca Fiebrink (McGill University, Montreal) - homepage

Moderators

Kris West (University of East Anglia, UK) - kw@cmp.uea.ac.uk
Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - elias.pampalk@gmail.com
Paul Lamere (Sun Microsystems Laboratories, USA) - paul.lamere@sun.com

@@ Line 3: / Line 3: @@
 == Introduction ==
-Although the automatic extraction of genre and artist labels from audio are interesting tasks, I (KW) believe that they are often used to evaluate more general music similarity techniques that compare two songs based on their audio content.  These techniques are hard to evaluate directly, for example with listening tests, as it is not practical to have a human listener rank the similarities of even a small test collection for a number of queries, which might require many hours of listening. Therefore, We have begun discussion of other methods of evaluating music similarity techniques, such as the methods described in [http://gatekeeper.research.compaq.com/pub/compaq/CRL/publications/logan/icme2001_logan.pdf Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001)], where the most similar 5, 10 or 20 songs were retrieved and the average number of songs in the same genre, from the same artist and from the same album calculated and more practical methods of subjective evaluation of similarity estimators (i.e. evaluation of performance, rather than comparison of output to that of human annotators). This evaluation could be extended to multiple genres if data is available. I believe it is also important that we evaluate other characteristics of these algorithms, such as the descriptor extraction time, query time and memory footprint (which may indicate the applicability of a technique to an application).
+Although the automatic extraction of genre and artist labels from audio are interesting tasks, they are often used to evaluate more general music similarity techniques that compare two songs based on their audio content.  These techniques are hard to evaluate directly, for example with listening tests, as it is not practical to have a human listener rank the similarities of even a small test collection for a number of queries, which might require many hours of listening. Therefore, as part of MIREX 2006 we will be directly evaluating music similarity techniques.  Since music similarity is a human notion, we will be relying primariliy on human evaluation of generated similarity lists as the primary method of evaluating music similarity.
-This page serves as a summary of the discussions held on the AudioSim06 mailing list and will eventually hold a final evaluation proposal for MIREX 2006.
 == Evaluation Summary ==