Difference between revisions of "2006:Audio Music Similarity and Retrieval Results"

Revision as of 12:05, 10 October 2006

Introduction

These are the results for the 2006 running of the Audio Music Similarity and Retrieval task set. For background information about this task set please refer to the Audio Music Similarity and Retrieval page.

Each system was given 5000 songs chosen from "uspop", "uscrap" and "cover song" collections. Each system then returned a 5000x5000 distance matrix. 60 songs were randomly selected as queries and the first 5 most highly ranked songs out of the 5000 were extracted for each query (after filtering out the query itself, returned results from the same artist and members of the cover song collection). Then, for each query, the returned results from all participants were grouped and were evaluated by human graders, each query being evaluated by 3 different graders with two scores (using the Evalutron 6000 system). Graders were asked to provide 1 categorical score with 3 categories: NS,SS,VS as explained below, and one fine score (in the range from 0 to 10).

Summary Data on Human Evaluations (Evalutron 6000)

Number of evaluators = 24
Number of evaluation per query/candidate pair = 3
Number of queries per grader = 7~8
Size of the candidate lists = Maximum 30 (with no overlap)
Number of randomly selected queries = 60

General Legend

Team ID

EP = Elias Pampalk
TP = Tim Pohle
VS = Vitor Soares
LR = Thomas Lidy and Andreas Rauber
KWT = Kris West (Trans)
KWL = Kris West (Likely)

Broad Categories

NS = Not Similar
SS = Somewhat Similar
VS = Very Similar

Calculating Summary Measures

Fine⁽¹⁾ = Sum of fine-grained human similarity decisions (0-10).
PSum⁽¹⁾ = Sum of human broad similarity decisions: NS=0, SS=1, VS=2.
WCsum⁽¹⁾ = 'World Cup' scoring: NS=0, SS=1, VS=3 (rewards Very Similar).
SDsum⁽¹⁾ = 'Stephen Downie' scoring: NS=0, SS=1, VS=4 (strongly rewards Very Similar).
Greater0⁽¹⁾ = NS=0, SS=1, VS=1 (binary relevance judgement).
Greater1⁽¹⁾ = NS=0, SS=0, VS=1 (binary relevance judgement using only Very Similar).

⁽¹⁾Normalized to the range 0 to 1.

Overall Summary Results

file /nema-raid/www/mirex/results/mirex06_as_overalllist.csv not found

http://staff.aist.go.jp/elias.pampalk/papers/mirex06/friedman.png

This figure shows the official ranking of the submissions computed using a Friedman test. The blue lines indicate significance boundaries at the p=0.05 level. As can be seen, the differences are not significant. For a more detailed description and discussion see [1].

Audio Music Similarity and Retrieval Runtime Data

file /nema-raid/www/mirex/results/as06_runtime.csv not found

For a description of the computers the submission ran on see MIREX_2006_Equipment.

Friedman Test with Multiple Comparisons Results (p=0.05)

The Friedman test was run in MATLAB against the Fine summary data over the 60 queries.
Command: [c,m,h,gnames] = multcompare(stats, 'ctype', 'tukey-kramer','estimate', 'friedman', 'alpha', 0.05); file /nema-raid/www/mirex/results/AV_sum_friedman.csv not found file /nema-raid/www/mirex/results/AV_fine_result.csv not found

Summary Results by Query

file /nema-raid/www/mirex/results/mirex06_as_uberlist.csv not found

Raw Scores

The raw data derived from the Evalutron 6000 human evaluations are located on the Audio Music Similarity and Retrieval Raw Data page.

Query Meta Data

file /nema-raid/www/mirex/results/as06_queries.csv not found

Results from Automatic Evaluation

file /nema-raid/www/mirex/results/as06_nonhuman_results.csv not found

Introduction to automatic evaluation

Automated evaluation of music similarity techniques based on a metadata catalogue has several advantages:

It does not require costly human ΓÇÿgradersΓÇÖ
Allows testing of incremental changes in indexing algorithms
Can achieve complete coverage over the test collection
Provides a target for machine-learning, feature-selection and optimisation experiments
Can predict the visualisation performance of an indexing technique
Can identify indexing ΓÇÿanomoliesΓÇÖ in the indices tested

Automated ΓÇÿpseudo-objectiveΓÇÖ evaluation of music similarity estimation techniques was introduced by Logan & Saloman [1] and were shown to be highly correlated with careful human-based evaluations by Pampalk [2]. The results of this contest support the conclusions of Pampalk [2] although further work is required to fully understand the evaluation statistics.

Description of evaluation statistics

The evaluation statistics

Neighbourhood clustering (artist, genre, album): average % of the top N results for each query in the collection with the same same label

Artist-filtered genre neighbourhood: average % of the top N results for each query belonging to the same genre label, ignoring matches from the same artist (ensures that results reflect musical not audio similarity)

Mean Artist-filtered genre neighbourhood: normalised form of the above statistic equally weighting each genre, penalising lop-sided performance

Normalised average distance between examples: average distance between examples with the same label, indicates degree of clustering and potential for visual organisation of a collection

Always similar (hubs): largest # of times an example appears in top N results for other queries, a result that appears too often will adversely affect performance without affecting other statistics

Never similar (orphans): % of examples that never appear in a top N result list and cannot be retrieved by search

Triangular inequality (metric space): indicates whether the function produces a metric distance space and therefore what visualisation techniques may be applied to it

Music-similarity evaluation issues

Care must be taken with all evaluations of audio musical similarity estimation techniques as there is a great potential for over-fitting in these experiments and for over-optimistic estimates of the performance of a system on novel test data to be produced.

The metadata catalog used to conduct automated evaluations should be as accurate as possible. However, this technique seems relatively robust to a degree of noise in the catalogue, parhaps due to its coarse granularity.

Small test collections do not allow us to accurately predict performance on larger test collections, for example:

Indexing anomalies (ΓÇÿhubsΓÇÖ and ΓÇÿorphansΓÇÖ) cannot yet be understood.
- a single ΓÇÿhubΓÇÖ was found in the results of one system
  - appeared in nearly 2/5 of result lists
  - removing this one example from the collection of 5000 tracks makes it appear that the system does not suffer from indexing anomolies.
- What will be the number and coverage of ΓÇÿhubsΓÇÖ in a 100,000 song DB?

Other Results from Automatic Evaluation

See Audio Music Similarity and Retrieval Other Automatic Evaluation Results page.

References

Logan and Salomon (ICME 2001), A Music Similarity Function Based On Signal Analysis.
One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query.
E. Pampalk, Computational Models of Music Similarity and their Application in Music Information Retrieval.

PhD thesis, Vienna University of Technology, Austria, March 2006

@@ Line 68: / Line 68: @@
 <csv>as06_nonhuman_results.csv</csv>
 === Introduction to automatic evaluation ===
@@ Line 79: / Line 80: @@
 Automated ΓÇÿpseudo-objectiveΓÇÖ evaluation of music similarity estimation techniques was introduced by Logan & Saloman [1] and were shown to be highly correlated with careful human-based evaluations by Pampalk [2]. The results of this contest support the conclusions of Pampalk [2] although further work is required to fully understand the evaluation statistics.
 === Description of evaluation statistics ===
@@ Line 103: / Line 105: @@
 ;Triangular inequality (metric space)
 :indicates whether the function produces a metric distance space and therefore what visualisation techniques may be applied to it
+=== Music-similarity evaluation issues ===
+Care must be taken with all evaluations of audio musical similarity estimation techniques as there is a great potential for over-fitting in these experiments and for over-optimistic estimates of the performance of a system on novel test data to be produced.
+The metadata catalog used to conduct automated evaluations should be as accurate as possible. However, this technique seems relatively robust to a degree of noise in the catalogue, parhaps due to its coarse granularity.
+Small test collections do not allow us to accurately predict performance on larger test collections, for example:
+* Indexing anomalies (ΓÇÿhubsΓÇÖ and ΓÇÿorphansΓÇÖ) cannot yet be understood.
+** a single ΓÇÿhubΓÇÖ was found in the results of one system
+*** appeared in nearly 2/5 of result lists
+*** removing this one example from the collection of 5000 tracks makes it appear that the system does not suffer from indexing anomolies.
+** What will be the number and coverage of ΓÇÿhubsΓÇÖ in a 100,000 song DB?

Difference between revisions of "2006:Audio Music Similarity and Retrieval Results"

Revision as of 12:05, 10 October 2006

Contents

Introduction

Summary Data on Human Evaluations (Evalutron 6000)

General Legend

Team ID

Broad Categories

Calculating Summary Measures

Overall Summary Results

Audio Music Similarity and Retrieval Runtime Data

Friedman Test with Multiple Comparisons Results (p=0.05)

Summary Results by Query

Raw Scores

Query Meta Data

Results from Automatic Evaluation

Introduction to automatic evaluation

Description of evaluation statistics

Music-similarity evaluation issues

Other Results from Automatic Evaluation

References

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools