Difference between revisions of "2006:Audio Music Similarity and Retrieval Results"

From MIREX Wiki
m (Robot: Automated text replacement (-http://www.music-ir.org/evaluation/MIREX/2006_abstracts/ +http://www.music-ir.org/mirex/abstracts/2006/))
 
(61 intermediate revisions by 8 users not shown)
Line 1: Line 1:
 
[[Category: Results]]
 
[[Category: Results]]
 
==Introduction==
 
==Introduction==
<br />
+
These are the results for the 2006 running of the Audio Music Similarity and Retrieval task set. For background information about this task set please refer to the [[2006:Audio Music Similarity and Retrieval]] page.
===General Legend===
+
 
====Team ID====
+
Each system was given 5000 songs chosen from "uspop", "uscrap" and "cover song" collections. Each system then returned a 5000x5000 distance matrix. 60 songs were randomly selected as queries and the first 5 most highly ranked songs out of the 5000 were extracted for each query (after filtering out the query itself, returned results from the same artist and members of the cover song collection). Then, for each query, the returned results from all participants were grouped and were evaluated by human graders, each query being evaluated by 3 different graders with two scores (using the Evalutron 6000 system). Graders were asked to provide 1 categorical score with 3 categories: NS,SS,VS as explained below, and one fine score (in the range from 0 to 10). An automated statistical evaluation based on a metadata catalog was also conducted. A desciption and analysis is provided below.
'''EP''' = Elias Pampalk<br />
+
 
 +
===Summary Data on Human Evaluations (Evalutron 6000)===
 +
'''Number of evaluators''' = 24<br />  
 +
'''Number of evaluation per query/candidate pair''' = 3<br />
 +
'''Number of queries per grader''' = 7~8<br />
 +
'''Size of the candidate lists''' = Maximum 30 (with no overlap)<br />
 +
'''Number of randomly selected queries''' = 60<br />
 +
 +
====General Legend====
 +
=====Team ID=====
 +
'''EP''' = [https://www.music-ir.org/mirex/abstracts/2006/AS_pampalk.pdf Elias Pampalk]<br />
 +
'''TP''' = [https://www.music-ir.org/mirex/abstracts/2006/AS_pohle.pdf Tim Pohle]<br />
 +
'''VS''' = Vitor Soares<br />
 +
'''LR''' = [https://www.music-ir.org/mirex/abstracts/2006/AS_lidy.pdf Thomas Lidy and Andreas Rauber]<br />
 +
'''KWT''' = Kris West (Trans)<br />
 
'''KWL''' = Kris West (Likely)<br />
 
'''KWL''' = Kris West (Likely)<br />
'''KWT''' = Kris West (Trans)<br />
 
'''LR''' = [https://www.music-ir.org/evaluation/MIREX/2006_abstracts/AS_lidy.pdf Thomas Lidy and Andreas Rauber]<br />
 
'''TP''' = [https://www.music-ir.org/evaluation/MIREX/2006_abstracts/AS_pohle.pdf Tim Pohle]<br />
 
'''VS''' = Vitor Soares<br />
 
  
====Broad Categories====
+
=====Broad Categories====
 
'''NS''' = Not Similar<br />
 
'''NS''' = Not Similar<br />
 
'''SS''' = Somewhat Similar<br />
 
'''SS''' = Somewhat Similar<br />
 
'''VS''' = Very Similar<br />
 
'''VS''' = Very Similar<br />
  
==Raw Scores==
+
=====Calculating Summary Measures=====
 +
'''Fine'''<sup>(1)</sup> = Sum of fine-grained human similarity decisions (0-10). <br />
 +
'''PSum'''<sup>(1)</sup> = Sum of human broad similarity decisions: NS=0, SS=1, VS=2. <br />
 +
'''WCsum'''<sup>(1)</sup> = 'World Cup' scoring: NS=0, SS=1, VS=3 (rewards Very Similar). <br />
 +
'''SDsum'''<sup>(1)</sup> = 'Stephen Downie' scoring: NS=0, SS=1, VS=4 (strongly rewards Very Similar). <br />
 +
'''Greater0'''<sup>(1)</sup> = NS=0, SS=1, VS=1 (binary relevance judgement).<br />
 +
'''Greater1'''<sup>(1)</sup> = NS=0, SS=0, VS=1 (binary relevance judgement using only Very Similar).<br />
 +
 
 +
<sup>(1)</sup>Normalized to the range 0 to 1.
 +
 
 +
===Overall Summary Results===
 +
 
 +
<csv>2006/mirex06_as_overalllist.csv</csv>
 +
 
 +
http://staff.aist.go.jp/elias.pampalk/papers/mirex06/friedman.png
 +
 
 +
This figure shows the official ranking of the submissions computed using a Friedman test. The blue lines indicate significance boundaries at the p=0.05 level. As can be seen, the differences are not significant. For a more detailed description and discussion see [http://staff.aist.go.jp/elias.pampalk/papers/pam_mirex06.pdf].
 +
 
 +
====Audio Music Similarity and Retrieval Runtime Data====
 +
 
 +
<csv>2006/as06_runtime.csv</csv>
 +
 
 +
For a description of the computers the submission ran on see [[2006:MIREX_2006_Equipment]].
 +
 
 +
===Friedman Test with Multiple Comparisons Results (p=0.05)===
 +
The Friedman test was run in MATLAB against the Fine summary data over the 60 queries.<br />
 +
Command: [c,m,h,gnames] = multcompare(stats, 'ctype', 'tukey-kramer','estimate', 'friedman', 'alpha', 0.05);
 +
<csv>2006/AV_sum_friedman.csv</csv>
 +
<csv>2006/AV_fine_result.csv</csv>
 +
 
 +
===Summary Results by Query===
 +
<csv>2006/mirex06_as_uberlist.csv</csv>
 +
 
 +
===Raw Scores===
 +
The raw data derived from the Evalutron 6000 human evaluations are located on the [[2006:Audio Music Similarity and Retrieval Raw Data]] page.
 +
 
 +
===Query Meta Data===
 +
<csv>2006/as06_queries.csv</csv>
 +
 
 +
 
 +
 
 +
==Results from Automatic Evaluation==
 +
 
 +
<csv>2006/as06_nonhuman_results.csv</csv>
 +
 
 +
 
 +
=== Other Results from Automatic Evaluation===
 +
See [[2006:Audio Music Similarity and Retrieval Other Automatic Evaluation Results]] page.
 +
 
 +
 
 +
=== Introduction to automatic evaluation ===
 +
Automated evaluation of music similarity techniques based on a metadata catalogue has several advantages:
 +
* It does not require costly human ΓÇÿgradersΓÇÖ
 +
* Allows testing of incremental changes in indexing algorithms
 +
* Can achieve complete coverage over the test collection
 +
* Provides a target for machine-learning, feature-selection and optimisation experiments
 +
* Can predict the visualisation performance of an indexing technique
 +
* Can identify indexing ΓÇÿanomoliesΓÇÖ in the indices tested
 +
 
 +
Automated ΓÇÿpseudo-objectiveΓÇÖ evaluation of music similarity estimation techniques was introduced by Logan & Saloman [1] and were shown to be highly correlated with careful human-based evaluations by Pampalk [2]. The results of this contest support the conclusions of Pampalk [2] although further work is required to fully understand the evaluation statistics.
 +
 
 +
 
 +
=== Description of evaluation statistics ===
 +
The evaluation statistics
 +
 
 +
;Neighbourhood clustering (artist, genre, album)
 +
:average % of the top N results for each query in the collection with the same same label
 +
 
 +
;Artist-filtered genre neighbourhood
 +
:average % of the top N results for each query belonging to the same genre label, ignoring matches from the same artist (ensures that results reflect musical not audio similarity)
 +
 
 +
;Mean Artist-filtered genre neighbourhood
 +
:normalised form of the above statistic equally weighting each genre, penalising lop-sided performance.
 +
 
 +
;Normalised average distance between examples
 +
:average distance between examples with the same label, indicates degree of clustering and potential for visual organisation of a collection
 +
 
 +
;Always similar (hubs)
 +
:largest # of times an example appears in top N results for other queries, a result that appears too often will adversely affect performance without affecting other statistics
 +
 
 +
;Never similar (orphans)
 +
:% of examples that never appear in a top N result list and cannot be retrieved by search
 +
 
 +
;Triangular inequality (metric space)
 +
:indicates whether the function produces a metric distance space and therefore what visualisation techniques may be applied to it
 +
 
 +
 
 +
=== Normalisation ===
 +
Each of the neighbourhood statistics, described above, has been normalised by the number of examples of each class (a genre, album or artist) that is available in the test database. E.g. if the collection contained 20 tracks by a particular artist and a particular system retrieved 10 of those examples in its top 50 results it would normally achieve an artist neighbourhood score of 20%, while the normalised form of the metric would report a score of 50% (of the available matches were retrieved). Such normalisation is intended to avoid bias introduced into the results by skewed distribution of the examples according to each label set.
 +
 
 +
The mean artist-filtered genre neighbourhood is a normalised form of the artist-filter genre neighbourhood metric, which gives equal weight to performance of a system on each genre class. This version of the statistic is intended to match the prior probabilities or distribtuion of examples according to genre labels used as queries in the human listening test (where an equal number of examples from each class was selected - stratified random sampling) instead of the prior probabilities or distribution of examples appearing in the database.
 +
 
 +
 
 +
=== Music-similarity evaluation issues ===
 +
Care must be taken with all evaluations of audio musical similarity estimation techniques as there is a great potential for over-fitting in these experiments and for over-optimistic estimates of the performance of a system on novel test data to be produced.
 +
 
 +
The metadata catalog used to conduct automated evaluations should be as accurate as possible. However, this technique seems relatively robust to a degree of noise in the catalogue, parhaps due to its coarse granularity.
 +
 
 +
Small test collections do not allow us to accurately predict performance on larger test collections, for example:
 +
* Indexing anomalies (ΓÇÿhubsΓÇÖ and ΓÇÿorphansΓÇÖ) cannot yet be understood.
 +
** a single ΓÇÿhubΓÇÖ was found in the results of one system
 +
*** appeared in nearly 2/5 of result lists
 +
*** removing this one example from the collection of 5000 tracks makes it appear that the system does not suffer from indexing anomolies.
 +
** What will be the number and coverage of ΓÇÿhubsΓÇÖ in a 100,000 song DB?
 +
[[User:Kriswest|Kriswest]]
 +
 
 +
 
 +
=== Directions for further work on evaluating audio music similarity ===
 +
* Establish whether stratified sampling used in the human evals is optimal for producing results that reflect human perception of the quality of music indexes or whether the database should be sampled randomly.
 +
** will influence selection of a statistic for use in automated evaluations or optimisation exps (artist-filtered genre or the mean artist-filtered genre).
 +
* Explain the indexing anomalies in some techniques.
 +
* Determine a safe minimum size for a test collection to be used to predict performance on an ΓÇÿindustrial-sizedΓÇÖ collection
 +
* Establish optimal granularity or range of granularities for a genre catalogue to be used in this type of evaluation (8, 32 or 256 classes?) and integrate a confusion-cost matrix to reduce the penalisation of confusion between similar genres of music (e.g Punk and Heavy Metal) relative to confusion between highly dissimilar genres (e. Classical and Heavy Metal).
 +
[[User:Kriswest|Kriswest]]
 +
 
 +
 
 +
=== Evaluation Tools in Music-2-Knowledge (M2K) ===
 +
The tools used to produce the evaluation statistics for MIREX 2006 will be released as part of M2K 1.2 (forthcoming). These tools provide services to:
 +
* import collection metadata and distance matrices
 +
* generate a stratified query set
 +
* extract artist-filtered results (for use in human evaluation exps)
 +
* calculate any of the evaluation statistics described above.
 +
 
 +
These tools may be used on the command-line by implementing the MIREX distance matrix file format, with M2K in the Data-2-Knowledge toolkit (D2K) or integrated into existing Java code with the new M2K API.
 +
 
 +
To obtain a copy of the evaluation tools prior to the M2K 1.2 release, contact [mailto:kw@cmp.uea.ac.uk Kris West]
 +
 
 +
 
 +
=== Comments ===
 +
The evaluation statistics for the MIREX 2006 Audio music similarity contest seem to support the contention that genre, artist and artist-filtered genre neighbourhood statistics are correlated with the human perception of the performance of music similarity estimators as they all reproduce the ranking produced by the human evaluation. However, the differences between systems in that evaluation are not statistically significant, so no firm conclusion can be made. Average distance statistics produce a different ranking but are intended to correlate with visualisation performance and not search.
 +
[[User:Kriswest|Kriswest]]
 +
 
 +
 
 +
=== A statistic for evaluation and use in selection & optimization experiments ===
 +
As each statitic was found to be correlated with the results of the listening test, any *may* be used  to evaluate performance and to guide model optimisation or feature selection/weighting experiments. However, unfiltered genre and artist identification statistics are known to allow overfitting to produce over-optimistic performance estimates. In a model optimisation or feature selection experiment these statistics will be more likely to indicate '''Audio-similarity''' performance rather than actual '''Music-similarity''' performance and may lead to the selection of sub-optimal features or models. The artist-filtered genre neighbourhood can be used to avoid this effect.
 +
 
 +
The results from MIREX 2006 do not show a significant drop in performance using the artist-filtered genre statistic as would normally be expected. This may be due to the excessively skewed distribution of examples in the database (roughly 50% of examples are labelled as Rock/Pop, while a further 25% are Rap & Hip-Hop). Hence, the difference between the results produced and the random baseline are not well emphasized. Normalising this statistic by the prior probabilities of examples in the database (taking the mean of the diagonal of the artist-filtered genre confusion matrix) equally weights the contribution of each class to the final statistic and prevents performance on a single class dominating the statistic. This normalised statistic shows a drastic reduction in the performance estimates for each system and increases the relative distance between each of the systems in the evaluation.
 +
[[User:Kriswest|Kriswest]]
  
<csv>mirex06_as_raw.csv</csv>
+
=== References ===
 +
# [http://gatekeeper.research.compaq.com/pub/compaq/CRL/publications/logan/icme2001_logan.pdf Logan and Salomon (ICME 2001), '''A Music Similarity Function Based On Signal Analysis'''].<br>One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query.
 +
# [http://www.ofai.at/~elias.pampalk/publications/pampalk06thesis.pdf E. Pampalk, '''Computational Models of Music Similarity and their Application in Music Information Retrieval.'''] <br>PhD thesis, Vienna University of Technology, Austria, March 2006

Latest revision as of 19:32, 13 May 2010

Introduction

These are the results for the 2006 running of the Audio Music Similarity and Retrieval task set. For background information about this task set please refer to the 2006:Audio Music Similarity and Retrieval page.

Each system was given 5000 songs chosen from "uspop", "uscrap" and "cover song" collections. Each system then returned a 5000x5000 distance matrix. 60 songs were randomly selected as queries and the first 5 most highly ranked songs out of the 5000 were extracted for each query (after filtering out the query itself, returned results from the same artist and members of the cover song collection). Then, for each query, the returned results from all participants were grouped and were evaluated by human graders, each query being evaluated by 3 different graders with two scores (using the Evalutron 6000 system). Graders were asked to provide 1 categorical score with 3 categories: NS,SS,VS as explained below, and one fine score (in the range from 0 to 10). An automated statistical evaluation based on a metadata catalog was also conducted. A desciption and analysis is provided below.

Summary Data on Human Evaluations (Evalutron 6000)

Number of evaluators = 24
Number of evaluation per query/candidate pair = 3
Number of queries per grader = 7~8
Size of the candidate lists = Maximum 30 (with no overlap)
Number of randomly selected queries = 60

General Legend

Team ID

EP = Elias Pampalk
TP = Tim Pohle
VS = Vitor Soares
LR = Thomas Lidy and Andreas Rauber
KWT = Kris West (Trans)
KWL = Kris West (Likely)

=Broad Categories

NS = Not Similar
SS = Somewhat Similar
VS = Very Similar

Calculating Summary Measures

Fine(1) = Sum of fine-grained human similarity decisions (0-10).
PSum(1) = Sum of human broad similarity decisions: NS=0, SS=1, VS=2.
WCsum(1) = 'World Cup' scoring: NS=0, SS=1, VS=3 (rewards Very Similar).
SDsum(1) = 'Stephen Downie' scoring: NS=0, SS=1, VS=4 (strongly rewards Very Similar).
Greater0(1) = NS=0, SS=1, VS=1 (binary relevance judgement).
Greater1(1) = NS=0, SS=0, VS=1 (binary relevance judgement using only Very Similar).

(1)Normalized to the range 0 to 1.

Overall Summary Results

EP TP VS LR KWT KWL
Fine 0.430 0.423 0.404 0.393 0.372 0.339
Psum 0.425 0.411 0.388 0.374 0.349 0.313
Wcsum 0.358 0.340 0.323 0.306 0.280 0.248
Sdsum 0.324 0.305 0.290 0.271 0.246 0.216
Greater0 0.627 0.623 0.586 0.579 0.557 0.509
Greater1 0.223 0.199 0.191 0.169 0.142 0.118

download these results as csv

http://staff.aist.go.jp/elias.pampalk/papers/mirex06/friedman.png

This figure shows the official ranking of the submissions computed using a Friedman test. The blue lines indicate significance boundaries at the p=0.05 level. As can be seen, the differences are not significant. For a more detailed description and discussion see [1].

Audio Music Similarity and Retrieval Runtime Data

Team ID Machine Run-time(seconds)
EP feature beer 6 5889
EP distant beer 6 6066
KWT feature beer 6 29899
KWT distant beer 6 25352
KWL both beer 4 47698
LR feature beer 4 13794
LR distant beer 4 131
TP feature beer 8 14333
TP distant beer 8 3337

download these results as csv

For a description of the computers the submission ran on see 2006:MIREX_2006_Equipment.

Friedman Test with Multiple Comparisons Results (p=0.05)

The Friedman test was run in MATLAB against the Fine summary data over the 60 queries.
Command: [c,m,h,gnames] = multcompare(stats, 'ctype', 'tukey-kramer','estimate', 'friedman', 'alpha', 0.05);

Friedman's ANOVA Table
Source SS df MS Chi-sq Prob>Chi-sq
Columns 84.7333 5 16.9467 24.2905 0.00019091
Error 961.7667 295 3.2602
Total 1046.5 359

download these results as csv

TeamID TeamID Lowerbound Mean Upperbound Significance
EP TP -0.963 0.008 0.980 FALSE
EP VS -0.755 0.217 1.188 FALSE
EP LR -0.630 0.342 1.313 FALSE
EP KWT -0.030 0.942 1.913 FALSE
EP KWL 0.320 1.292 2.263 TRUE
TP VS -0.763 0.208 1.180 FALSE
TP LR -0.638 0.333 1.305 FALSE
TP KWT -0.038 0.933 1.905 FALSE
TP KWL 0.312 1.283 2.255 TRUE
VS LR -0.847 0.125 1.097 FALSE
VS KWT -0.247 0.725 1.697 FALSE
VS KWL 0.103 1.075 2.047 TRUE
LR KWT -0.372 0.600 1.572 FALSE
LR KWL -0.022 0.950 1.922 FALSE
KWT KWL -0.622 0.350 1.322 FALSE

download these results as csv

Summary Results by Query

Fine
queryID EP TP VS LR KWT KWL
a001528 0.428 0.318 0.387 0.459 0.354 0.331
a004667 0.429 0.503 0.579 0.529 0.383 0.429
a000518 0.107 0.217 0.221 0.213 0.145 0.206
a002693 0.657 0.495 0.337 0.519 0.345 0.483
a004830 0.338 0.348 0.345 0.311 0.413 0.354
a002784 0.371 0.432 0.280 0.347 0.281 0.282
a005705 0.590 0.739 0.500 0.389 0.337 0.247
a006272 0.258 0.233 0.219 0.149 0.247 0.200
a007005 0.188 0.078 0.166 0.190 0.183 0.085
a008401 0.365 0.242 0.319 0.176 0.514 0.247
a008850 0.111 0.464 0.211 0.211 0.123 0.113
a007054 0.230 0.315 0.354 0.295 0.295 0.343
a008365 0.260 0.344 0.337 0.280 0.374 0.145
b000990 0.477 0.401 0.403 0.366 0.487 0.335
b001799 0.301 0.437 0.277 0.464 0.327 0.303
b001516 0.367 0.579 0.471 0.358 0.329 0.358
b002576 0.138 0.228 0.167 0.213 0.445 0.167
b004483 0.479 0.434 0.266 0.285 0.477 0.264
b006517 0.599 0.739 0.149 0.709 0.615 0.176
b005395 0.374 0.402 0.342 0.511 0.503 0.319
b007493 0.809 0.677 0.736 0.429 0.577 0.785
b005447 0.447 0.342 0.553 0.254 0.426 0.419
b009401 0.639 0.513 0.695 0.612 0.557 0.464
b006979 0.228 0.219 0.442 0.314 0.158 0.224
b012801 0.316 0.215 0.278 0.443 0.419 0.527
b008611 0.337 0.329 0.277 0.299 0.247 0.227
b013992 0.266 0.309 0.334 0.331 0.219 0.207
b015082 0.497 0.530 0.494 0.527 0.280 0.500
b015991 0.818 0.617 0.784 0.787 0.305 0.385
b009364 0.393 0.412 0.535 0.443 0.489 0.384
a007915 0.656 0.500 0.690 0.373 0.248 0.397
a002856 0.309 0.164 0.173 0.167 0.420 0.307
a000751 0.653 0.475 0.353 0.305 0.171 0.244
a002907 0.487 0.509 0.376 0.156 0.269 0.315
a000193 0.344 0.425 0.296 0.305 0.199 0.321
b006599 0.353 0.300 0.363 0.483 0.201 0.245
b010953 0.631 0.564 0.747 0.731 0.497 0.502
a003397 0.325 0.361 0.223 0.315 0.161 0.167
a006525 0.448 0.421 0.192 0.231 0.397 0.155
b012279 0.592 0.306 0.564 0.305 0.537 0.565
a004526 0.497 0.401 0.421 0.461 0.590 0.485
b010504 0.727 0.594 0.631 0.615 0.377 0.428
b017426 0.393 0.423 0.396 0.424 0.449 0.467
b011185 0.611 0.455 0.524 0.547 0.353 0.465
b011453 0.475 0.317 0.292 0.410 0.500 0.535
b006618 0.480 0.534 0.584 0.499 0.526 0.559
b017223 0.059 0.202 0.179 0.202 0.452 0.467
a001530 0.615 0.593 0.337 0.625 0.550 0.312
b019063 0.445 0.403 0.383 0.433 0.398 0.162
b005063 0.587 0.711 0.334 0.235 0.507 0.271
a004035 0.495 0.557 0.530 0.516 0.299 0.292
a003713 0.425 0.409 0.259 0.321 0.467 0.427
b015200 0.198 0.236 0.109 0.198 0.140 0.174
a004755 0.556 0.584 0.699 0.493 0.481 0.437
b019276 0.310 0.447 0.360 0.393 0.210 0.347
b018901 0.711 0.743 0.796 0.717 0.602 0.446
b005570 0.334 0.363 0.424 0.331 0.268 0.298
b006144 0.513 0.565 0.671 0.600 0.537 0.421
b002169 0.274 0.426 0.423 0.344 0.307 0.386
b016133 0.487 0.305 0.441 0.415 0.330 0.226
Ave. Fine Score: 0.430 0.423 0.404 0.393 0.372 0.339
Psum
queryID EP TP VS LR KWT KWL
a001528 0.400 0.300 0.333 0.433 0.300 0.233
a004667 0.367 0.400 0.467 0.467 0.300 0.333
a000518 0.033 0.167 0.200 0.267 0.133 0.267
a002693 0.700 0.467 0.300 0.500 0.367 0.467
a004830 0.300 0.267 0.233 0.267 0.400 0.300
a002784 0.433 0.467 0.333 0.433 0.300 0.267
a005705 0.633 0.800 0.500 0.367 0.267 0.233
a006272 0.167 0.133 0.100 0.000 0.067 0.100
a007005 0.167 0.067 0.167 0.167 0.167 0.133
a008401 0.267 0.133 0.233 0.067 0.567 0.167
a008850 0.033 0.433 0.200 0.133 0.067 0.067
a007054 0.200 0.267 0.333 0.267 0.233 0.400
a008365 0.167 0.300 0.300 0.233 0.367 0.067
b000990 0.567 0.433 0.400 0.367 0.533 0.333
b001799 0.367 0.500 0.267 0.500 0.367 0.367
b001516 0.367 0.633 0.467 0.300 0.333 0.300
b002576 0.100 0.233 0.200 0.233 0.433 0.133
b004483 0.533 0.433 0.200 0.233 0.533 0.200
b006517 0.633 0.833 0.067 0.767 0.667 0.133
b005395 0.467 0.500 0.433 0.633 0.600 0.400
b007493 0.900 0.733 0.867 0.533 0.567 0.900
b005447 0.433 0.400 0.667 0.167 0.467 0.433
b009401 0.733 0.533 0.733 0.700 0.667 0.567
b006979 0.200 0.200 0.433 0.300 0.133 0.200
b012801 0.267 0.100 0.267 0.400 0.433 0.567
b008611 0.267 0.300 0.233 0.200 0.200 0.133
b013992 0.300 0.267 0.267 0.233 0.200 0.067
b015082 0.533 0.633 0.567 0.567 0.267 0.633
b015991 0.967 0.733 0.900 0.967 0.300 0.333
b009364 0.333 0.300 0.500 0.367 0.433 0.267
a007915 0.733 0.567 0.800 0.300 0.267 0.367
a002856 0.300 0.067 0.067 0.133 0.467 0.333
a000751 0.733 0.433 0.267 0.200 0.067 0.100
a002907 0.367 0.500 0.367 0.100 0.200 0.300
a000193 0.367 0.433 0.267 0.233 0.100 0.233
b006599 0.167 0.133 0.167 0.300 0.100 0.167
b010953 0.767 0.600 0.800 0.900 0.500 0.633
a003397 0.300 0.400 0.200 0.267 0.133 0.100
a006525 0.467 0.400 0.167 0.200 0.400 0.067
b012279 0.600 0.233 0.600 0.233 0.533 0.567
a004526 0.433 0.333 0.333 0.433 0.567 0.500
b010504 0.767 0.633 0.700 0.667 0.333 0.367
b017426 0.400 0.533 0.433 0.467 0.467 0.533
b011185 0.533 0.333 0.467 0.500 0.233 0.433
b011453 0.433 0.233 0.200 0.367 0.433 0.433
b006618 0.533 0.567 0.633 0.533 0.500 0.600
b017223 0.033 0.100 0.133 0.167 0.500 0.500
a001530 0.733 0.633 0.367 0.667 0.600 0.333
b019063 0.500 0.433 0.367 0.400 0.367 0.100
b005063 0.567 0.700 0.367 0.167 0.533 0.233
a004035 0.500 0.567 0.633 0.533 0.167 0.233
a003713 0.400 0.333 0.067 0.267 0.433 0.367
b015200 0.167 0.267 0.067 0.200 0.100 0.200
a004755 0.467 0.533 0.600 0.367 0.400 0.333
b019276 0.300 0.500 0.333 0.467 0.133 0.333
b018901 0.700 0.667 0.800 0.700 0.600 0.367
b005570 0.333 0.367 0.367 0.333 0.200 0.200
b006144 0.433 0.533 0.667 0.600 0.433 0.333
b002169 0.167 0.400 0.467 0.333 0.233 0.367
b016133 0.467 0.267 0.433 0.333 0.300 0.167
Ave. Psum Score: 0.425 0.411 0.388 0.374 0.349 0.313
WCsum
queryID EP TP VS LR KWT KWL
a001528 0.289 0.200 0.244 0.333 0.222 0.178
a004667 0.289 0.289 0.356 0.356 0.200 0.244
a000518 0.022 0.111 0.133 0.200 0.089 0.200
a002693 0.622 0.333 0.200 0.400 0.289 0.400
a004830 0.200 0.178 0.156 0.178 0.289 0.200
a002784 0.356 0.378 0.267 0.356 0.222 0.200
a005705 0.578 0.733 0.422 0.267 0.178 0.156
a006272 0.111 0.089 0.067 0.000 0.044 0.067
a007005 0.111 0.044 0.111 0.133 0.111 0.089
a008401 0.200 0.133 0.200 0.044 0.511 0.111
a008850 0.022 0.333 0.133 0.089 0.044 0.044
a007054 0.156 0.222 0.244 0.200 0.156 0.311
a008365 0.111 0.200 0.200 0.178 0.289 0.044
b000990 0.533 0.356 0.311 0.289 0.467 0.244
b001799 0.267 0.400 0.178 0.378 0.289 0.244
b001516 0.311 0.600 0.422 0.267 0.311 0.267
b002576 0.067 0.156 0.133 0.156 0.356 0.111
b004483 0.422 0.311 0.156 0.200 0.444 0.133
b006517 0.556 0.800 0.044 0.711 0.622 0.111
b005395 0.378 0.422 0.356 0.556 0.533 0.311
b007493 0.867 0.711 0.822 0.467 0.511 0.867
b005447 0.378 0.356 0.600 0.133 0.400 0.378
b009401 0.644 0.400 0.644 0.600 0.556 0.489
b006979 0.156 0.156 0.378 0.267 0.111 0.178
b012801 0.200 0.067 0.200 0.311 0.289 0.467
b008611 0.200 0.222 0.200 0.133 0.133 0.089
b013992 0.200 0.178 0.178 0.156 0.133 0.044
b015082 0.489 0.556 0.467 0.467 0.222 0.578
b015991 0.956 0.667 0.867 0.956 0.267 0.267
b009364 0.244 0.200 0.400 0.311 0.356 0.178
a007915 0.667 0.489 0.756 0.244 0.200 0.311
a002856 0.222 0.044 0.067 0.089 0.356 0.244
a000751 0.689 0.422 0.200 0.133 0.044 0.067
a002907 0.311 0.444 0.311 0.067 0.133 0.222
a000193 0.311 0.378 0.222 0.178 0.067 0.178
b006599 0.111 0.089 0.111 0.244 0.067 0.111
b010953 0.711 0.511 0.756 0.867 0.422 0.578
a003397 0.222 0.333 0.156 0.222 0.111 0.067
a006525 0.444 0.378 0.133 0.200 0.378 0.044
b012279 0.533 0.200 0.556 0.200 0.467 0.489
a004526 0.356 0.244 0.289 0.378 0.467 0.378
b010504 0.689 0.533 0.644 0.622 0.289 0.289
b017426 0.333 0.444 0.311 0.378 0.356 0.444
b011185 0.422 0.267 0.333 0.378 0.178 0.311
b011453 0.356 0.178 0.133 0.289 0.311 0.356
b006618 0.422 0.511 0.556 0.444 0.378 0.511
b017223 0.022 0.067 0.111 0.111 0.422 0.422
a001530 0.644 0.511 0.311 0.556 0.489 0.244
b019063 0.422 0.356 0.289 0.311 0.311 0.067
b005063 0.511 0.622 0.267 0.111 0.400 0.178
a004035 0.422 0.489 0.556 0.422 0.111 0.178
a003713 0.356 0.267 0.044 0.200 0.356 0.267
b015200 0.111 0.200 0.044 0.133 0.067 0.133
a004755 0.378 0.444 0.489 0.244 0.311 0.267
b019276 0.222 0.378 0.267 0.378 0.089 0.222
b018901 0.600 0.556 0.733 0.600 0.533 0.244
b005570 0.244 0.289 0.289 0.222 0.133 0.133
b006144 0.333 0.422 0.578 0.533 0.356 0.289
b002169 0.111 0.378 0.378 0.244 0.156 0.311
b016133 0.356 0.178 0.378 0.244 0.222 0.133
Ave. WCsum Score: 0.358 0.340 0.323 0.306 0.280 0.248
Sdsum
queryID EP TP VS LR KWT KWL
a001528 0.233 0.150 0.200 0.283 0.183 0.150
a004667 0.250 0.233 0.300 0.300 0.150 0.200
a000518 0.017 0.083 0.100 0.167 0.067 0.167
a002693 0.583 0.267 0.150 0.350 0.250 0.367
a004830 0.150 0.133 0.117 0.133 0.233 0.150
a002784 0.317 0.333 0.233 0.317 0.183 0.167
a005705 0.550 0.700 0.383 0.217 0.133 0.117
a006272 0.083 0.067 0.050 0.000 0.033 0.050
a007005 0.083 0.033 0.083 0.117 0.083 0.067
a008401 0.167 0.133 0.183 0.033 0.483 0.083
a008850 0.017 0.283 0.100 0.067 0.033 0.033
a007054 0.133 0.200 0.200 0.167 0.117 0.267
a008365 0.083 0.150 0.150 0.150 0.250 0.033
b000990 0.517 0.317 0.267 0.250 0.433 0.200
b001799 0.217 0.350 0.133 0.317 0.250 0.183
b001516 0.283 0.583 0.400 0.250 0.300 0.250
b002576 0.050 0.117 0.100 0.117 0.317 0.100
b004483 0.367 0.250 0.133 0.183 0.400 0.100
b006517 0.517 0.783 0.033 0.683 0.600 0.100
b005395 0.333 0.383 0.317 0.517 0.500 0.267
b007493 0.850 0.700 0.800 0.433 0.483 0.850
b005447 0.350 0.333 0.567 0.117 0.367 0.350
b009401 0.600 0.333 0.600 0.550 0.500 0.450
b006979 0.133 0.133 0.350 0.250 0.100 0.167
b012801 0.167 0.050 0.167 0.267 0.217 0.417
b008611 0.167 0.183 0.183 0.100 0.100 0.067
b013992 0.150 0.133 0.133 0.117 0.100 0.033
b015082 0.467 0.517 0.417 0.417 0.200 0.550
b015991 0.950 0.633 0.850 0.950 0.250 0.233
b009364 0.200 0.150 0.350 0.283 0.317 0.133
a007915 0.633 0.450 0.733 0.217 0.167 0.283
a002856 0.183 0.033 0.067 0.067 0.300 0.200
a000751 0.667 0.417 0.167 0.100 0.033 0.050
a002907 0.283 0.417 0.283 0.050 0.100 0.183
a000193 0.283 0.350 0.200 0.150 0.050 0.150
b006599 0.083 0.067 0.083 0.217 0.050 0.083
b010953 0.683 0.467 0.733 0.850 0.383 0.550
a003397 0.183 0.300 0.133 0.200 0.100 0.050
a006525 0.433 0.367 0.117 0.200 0.367 0.033
b012279 0.500 0.183 0.533 0.183 0.433 0.450
a004526 0.317 0.200 0.267 0.350 0.417 0.317
b010504 0.650 0.483 0.617 0.600 0.267 0.250
b017426 0.300 0.400 0.250 0.333 0.300 0.400
b011185 0.367 0.233 0.267 0.317 0.150 0.250
b011453 0.317 0.150 0.100 0.250 0.250 0.317
b006618 0.367 0.483 0.517 0.400 0.317 0.467
b017223 0.017 0.050 0.100 0.083 0.383 0.383
a001530 0.600 0.450 0.283 0.500 0.433 0.200
b019063 0.383 0.317 0.250 0.267 0.283 0.050
b005063 0.483 0.583 0.217 0.083 0.333 0.150
a004035 0.383 0.450 0.517 0.367 0.083 0.150
a003713 0.333 0.233 0.033 0.167 0.317 0.217
b015200 0.083 0.167 0.033 0.100 0.050 0.100
a004755 0.333 0.400 0.433 0.183 0.267 0.233
b019276 0.183 0.317 0.233 0.333 0.067 0.167
b018901 0.550 0.500 0.700 0.550 0.500 0.183
b005570 0.200 0.250 0.250 0.167 0.100 0.100
b006144 0.283 0.367 0.533 0.500 0.317 0.267
b002169 0.083 0.367 0.333 0.200 0.117 0.283
b016133 0.300 0.133 0.350 0.200 0.183 0.117
Ave. SDsum Score: 0.324 0.305 0.290 0.271 0.246 0.216
Greater0
queryID EP TP VS LR KWT KWL
a001528 0.733 0.600 0.600 0.733 0.533 0.400
a004667 0.600 0.733 0.800 0.800 0.600 0.600
a000518 0.067 0.333 0.400 0.467 0.267 0.467
a002693 0.933 0.867 0.600 0.800 0.600 0.667
a004830 0.600 0.533 0.467 0.533 0.733 0.600
a002784 0.667 0.733 0.533 0.667 0.533 0.467
a005705 0.800 1.000 0.733 0.667 0.533 0.467
a006272 0.333 0.267 0.200 0.000 0.133 0.200
a007005 0.333 0.133 0.333 0.267 0.333 0.267
a008401 0.467 0.133 0.333 0.133 0.733 0.333
a008850 0.067 0.733 0.400 0.267 0.133 0.133
a007054 0.333 0.400 0.600 0.467 0.467 0.667
a008365 0.333 0.600 0.600 0.400 0.600 0.133
b000990 0.667 0.667 0.667 0.600 0.733 0.600
b001799 0.667 0.800 0.533 0.867 0.600 0.733
b001516 0.533 0.733 0.600 0.400 0.400 0.400
b002576 0.200 0.467 0.400 0.467 0.667 0.200
b004483 0.867 0.800 0.333 0.333 0.800 0.400
b006517 0.867 0.933 0.133 0.933 0.800 0.200
b005395 0.733 0.733 0.667 0.867 0.800 0.667
b007493 1.000 0.800 1.000 0.733 0.733 1.000
b005447 0.600 0.533 0.867 0.267 0.667 0.600
b009401 1.000 0.933 1.000 1.000 1.000 0.800
b006979 0.333 0.333 0.600 0.400 0.200 0.267
b012801 0.467 0.200 0.467 0.667 0.867 0.867
b008611 0.467 0.533 0.333 0.400 0.400 0.267
b013992 0.600 0.533 0.533 0.467 0.400 0.133
b015082 0.667 0.867 0.867 0.867 0.400 0.800
b015991 1.000 0.933 1.000 1.000 0.400 0.533
b009364 0.600 0.600 0.800 0.533 0.667 0.533
a007915 0.933 0.800 0.933 0.467 0.467 0.533
a002856 0.533 0.133 0.067 0.267 0.800 0.600
a000751 0.867 0.467 0.467 0.400 0.133 0.200
a002907 0.533 0.667 0.533 0.200 0.400 0.533
a000193 0.533 0.600 0.400 0.400 0.200 0.400
b006599 0.333 0.267 0.333 0.467 0.200 0.333
b010953 0.933 0.867 0.933 1.000 0.733 0.800
a003397 0.533 0.600 0.333 0.400 0.200 0.200
a006525 0.533 0.467 0.267 0.200 0.467 0.133
b012279 0.800 0.333 0.733 0.333 0.733 0.800
a004526 0.667 0.600 0.467 0.600 0.867 0.867
b010504 1.000 0.933 0.867 0.800 0.467 0.600
b017426 0.600 0.800 0.800 0.733 0.800 0.800
b011185 0.867 0.533 0.867 0.867 0.400 0.800
b011453 0.667 0.400 0.400 0.600 0.800 0.667
b006618 0.867 0.733 0.867 0.800 0.867 0.867
b017223 0.067 0.200 0.200 0.333 0.733 0.733
a001530 1.000 1.000 0.533 1.000 0.933 0.600
b019063 0.733 0.667 0.600 0.667 0.533 0.200
b005063 0.733 0.933 0.667 0.333 0.933 0.400
a004035 0.733 0.800 0.867 0.867 0.333 0.400
a003713 0.533 0.533 0.133 0.467 0.667 0.667
b015200 0.333 0.467 0.133 0.400 0.200 0.400
a004755 0.733 0.800 0.933 0.733 0.667 0.533
b019276 0.533 0.867 0.533 0.733 0.267 0.667
b018901 1.000 1.000 1.000 1.000 0.800 0.733
b005570 0.600 0.600 0.600 0.667 0.400 0.400
b006144 0.733 0.867 0.933 0.800 0.667 0.467
b002169 0.333 0.467 0.733 0.600 0.467 0.533
b016133 0.800 0.533 0.600 0.600 0.533 0.267
Ave. greater0 Score: 0.627 0.623 0.586 0.579 0.557 0.509
Greater1
queryID EP TP VS LR KWT KWL
a001528 0.067 0.000 0.067 0.133 0.067 0.067
a004667 0.133 0.067 0.133 0.133 0.000 0.067
a000518 0.000 0.000 0.000 0.067 0.000 0.067
a002693 0.467 0.067 0.000 0.200 0.133 0.267
a004830 0.000 0.000 0.000 0.000 0.067 0.000
a002784 0.200 0.200 0.133 0.200 0.067 0.067
a005705 0.467 0.600 0.267 0.067 0.000 0.000
a006272 0.000 0.000 0.000 0.000 0.000 0.000
a007005 0.000 0.000 0.000 0.067 0.000 0.000
a008401 0.067 0.133 0.133 0.000 0.400 0.000
a008850 0.000 0.133 0.000 0.000 0.000 0.000
a007054 0.067 0.133 0.067 0.067 0.000 0.133
a008365 0.000 0.000 0.000 0.067 0.133 0.000
b000990 0.467 0.200 0.133 0.133 0.333 0.067
b001799 0.067 0.200 0.000 0.133 0.133 0.000
b001516 0.200 0.533 0.333 0.200 0.267 0.200
b002576 0.000 0.000 0.000 0.000 0.200 0.067
b004483 0.200 0.067 0.067 0.133 0.267 0.000
b006517 0.400 0.733 0.000 0.600 0.533 0.067
b005395 0.200 0.267 0.200 0.400 0.400 0.133
b007493 0.800 0.667 0.733 0.333 0.400 0.800
b005447 0.267 0.267 0.467 0.067 0.267 0.267
b009401 0.467 0.133 0.467 0.400 0.333 0.333
b006979 0.067 0.067 0.267 0.200 0.067 0.133
b012801 0.067 0.000 0.067 0.133 0.000 0.267
b008611 0.067 0.067 0.133 0.000 0.000 0.000
b013992 0.000 0.000 0.000 0.000 0.000 0.000
b015082 0.400 0.400 0.267 0.267 0.133 0.467
b015991 0.933 0.533 0.800 0.933 0.200 0.133
b009364 0.067 0.000 0.200 0.200 0.200 0.000
a007915 0.533 0.333 0.667 0.133 0.067 0.200
a002856 0.067 0.000 0.067 0.000 0.133 0.067
a000751 0.600 0.400 0.067 0.000 0.000 0.000
a002907 0.200 0.333 0.200 0.000 0.000 0.067
a000193 0.200 0.267 0.133 0.067 0.000 0.067
b006599 0.000 0.000 0.000 0.133 0.000 0.000
b010953 0.600 0.333 0.667 0.800 0.267 0.467
a003397 0.067 0.200 0.067 0.133 0.067 0.000
a006525 0.400 0.333 0.067 0.200 0.333 0.000
b012279 0.400 0.133 0.467 0.133 0.333 0.333
a004526 0.200 0.067 0.200 0.267 0.267 0.133
b010504 0.533 0.333 0.533 0.533 0.200 0.133
b017426 0.200 0.267 0.067 0.200 0.133 0.267
b011185 0.200 0.133 0.067 0.133 0.067 0.067
b011453 0.200 0.067 0.000 0.133 0.067 0.200
b006618 0.200 0.400 0.400 0.267 0.133 0.333
b017223 0.000 0.000 0.067 0.000 0.267 0.267
a001530 0.467 0.267 0.200 0.333 0.267 0.067
b019063 0.267 0.200 0.133 0.133 0.200 0.000
b005063 0.400 0.467 0.067 0.000 0.133 0.067
a004035 0.267 0.333 0.400 0.200 0.000 0.067
a003713 0.267 0.133 0.000 0.067 0.200 0.067
b015200 0.000 0.067 0.000 0.000 0.000 0.000
a004755 0.200 0.267 0.267 0.000 0.133 0.133
b019276 0.067 0.133 0.133 0.200 0.000 0.000
b018901 0.400 0.333 0.600 0.400 0.400 0.000
b005570 0.067 0.133 0.133 0.000 0.000 0.000
b006144 0.133 0.200 0.400 0.400 0.200 0.200
b002169 0.000 0.333 0.200 0.067 0.000 0.200
b016133 0.133 0.000 0.267 0.067 0.067 0.067
Ave. greater1 Score: 0.223 0.199 0.191 0.169 0.142 0.118

download these results as csv

Raw Scores

The raw data derived from the Evalutron 6000 human evaluations are located on the 2006:Audio Music Similarity and Retrieval Raw Data page.

Query Meta Data

queryID artist genre
a001528 Xpression Jazz
a004667 The Tony Rich Project R&B
a000518 Junior C. Reggae
a002693 B.J. Thomas Country
a004830 Luciano & Co Reggae
a002784 Elton John Rock
a005705 Jessica R&B
a006272 Orlando Barroso Latin
a007005 Big Time Operator Jazz
a008401 Prince Malachi Reggae
a008850 Elida y Avante Latin
a007054 Profyle R&B
a008365 Barbara Sfraga Jazz
b000990 Guns N' Roses Rock
b001799 Enya New Age
b001516 Britney Spears Rock
b002576 Depeche Mode Rock
b004483 Elvis Costello Rock
b006517 Paul Van Dyk Electronica & Dance
b005395 Ozzy Osbourne Rock
b007493 Eminem Rap & Hip Hop
b005447 Mudvayne Rock
b009401 Ja Rule Rap & Hip Hop
b006979 Cat Stevens Rock
b012801 The Chemical Brothers Electronica & Dance
b008611 The Cranberries Rock
b013992 Enigma New Age
b015082 DMX Rap & Hip Hop
b015991 Tim McGraw Country
b009364 Bon Jovi Rock
a007915 Victor Sanz Country
a002856 Atomic Babies Electronica & Dance
a000751 Brian Hughes Jazz
a002907 Gary Meek Jazz
a000193 Mercurio Latin
b006599 Selena Latin
b010953 Jessica Andrews Country
a003397 Roy Davis Jr. Electronica & Dance
a006525 Wind Machine New Age
b012279 OutKast Rap & Hip Hop
a004526 Shannon R&B
b010504 LL Cool J Rap & Hip Hop
b017426 Shaggy Reggae
b011185 Sting Rock
b011453 Neil Young Rock
b006618 Foo Fighters Rock
b017223 Nirvana Rock
a001530 M?tley Cr?e Rock
b019063 Smashing Pumpkins Rock
b005063 Sublime Rock
a004035 Toy-Box Electronica & Dance
a003713 Brian Bromberg Jazz
b015200 Mike Oldfield New Age
a004755 Profyle R&B
b019276 Robbie Williams Rock
b018901 Nelly Rap & Hip Hop
b005570 Everything But the Girl Rock
b006144 Def Leppard Rock
b002169 No Doubt Rock
b016133 Janet Jackson Rock

download these results as csv


Results from Automatic Evaluation

Pohle Pampalk Lidy & Rauber West (Trans) West (Likely)
top20genre% 60.84% 60.64% 56.96% 53.18% 47.76%
top20artist% 41.32% 34.73% 27.73% 20.68% 15.85%
top20album% 36.57% 30.54% 32.16% 24.72% 19.42%
artist-filtered genre% 58.91% 60.70% 56.71% 54.06% 49.20%
mean artist-filtered genre% 27.27% 28.27% 26.01% 21.62% 19.56%
avg dist - genre 0.6970 0.9924 0.9524 0.9738 0.9830
avg dist - artist 0.4244 0.9772 0.7339 0.8734 0.6010
avg dist - album 0.3721 0.9758 0.7205 0.8689 0.5702
triangular inequality 32.02% 100.00% 100.00% 100.00% 55.08%
top20always-sim 260 1928 137 173 90
top20never-sim% 0.0% 0.0% 0.0% 0.0% 0.0%

download these results as csv


Other Results from Automatic Evaluation

See 2006:Audio Music Similarity and Retrieval Other Automatic Evaluation Results page.


Introduction to automatic evaluation

Automated evaluation of music similarity techniques based on a metadata catalogue has several advantages:

  • It does not require costly human ΓÇÿgradersΓÇÖ
  • Allows testing of incremental changes in indexing algorithms
  • Can achieve complete coverage over the test collection
  • Provides a target for machine-learning, feature-selection and optimisation experiments
  • Can predict the visualisation performance of an indexing technique
  • Can identify indexing ΓÇÿanomoliesΓÇÖ in the indices tested

Automated ΓÇÿpseudo-objectiveΓÇÖ evaluation of music similarity estimation techniques was introduced by Logan & Saloman [1] and were shown to be highly correlated with careful human-based evaluations by Pampalk [2]. The results of this contest support the conclusions of Pampalk [2] although further work is required to fully understand the evaluation statistics.


Description of evaluation statistics

The evaluation statistics

Neighbourhood clustering (artist, genre, album)
average % of the top N results for each query in the collection with the same same label
Artist-filtered genre neighbourhood
average % of the top N results for each query belonging to the same genre label, ignoring matches from the same artist (ensures that results reflect musical not audio similarity)
Mean Artist-filtered genre neighbourhood
normalised form of the above statistic equally weighting each genre, penalising lop-sided performance.
Normalised average distance between examples
average distance between examples with the same label, indicates degree of clustering and potential for visual organisation of a collection
Always similar (hubs)
largest # of times an example appears in top N results for other queries, a result that appears too often will adversely affect performance without affecting other statistics
Never similar (orphans)
% of examples that never appear in a top N result list and cannot be retrieved by search
Triangular inequality (metric space)
indicates whether the function produces a metric distance space and therefore what visualisation techniques may be applied to it


Normalisation

Each of the neighbourhood statistics, described above, has been normalised by the number of examples of each class (a genre, album or artist) that is available in the test database. E.g. if the collection contained 20 tracks by a particular artist and a particular system retrieved 10 of those examples in its top 50 results it would normally achieve an artist neighbourhood score of 20%, while the normalised form of the metric would report a score of 50% (of the available matches were retrieved). Such normalisation is intended to avoid bias introduced into the results by skewed distribution of the examples according to each label set.

The mean artist-filtered genre neighbourhood is a normalised form of the artist-filter genre neighbourhood metric, which gives equal weight to performance of a system on each genre class. This version of the statistic is intended to match the prior probabilities or distribtuion of examples according to genre labels used as queries in the human listening test (where an equal number of examples from each class was selected - stratified random sampling) instead of the prior probabilities or distribution of examples appearing in the database.


Music-similarity evaluation issues

Care must be taken with all evaluations of audio musical similarity estimation techniques as there is a great potential for over-fitting in these experiments and for over-optimistic estimates of the performance of a system on novel test data to be produced.

The metadata catalog used to conduct automated evaluations should be as accurate as possible. However, this technique seems relatively robust to a degree of noise in the catalogue, parhaps due to its coarse granularity.

Small test collections do not allow us to accurately predict performance on larger test collections, for example:

  • Indexing anomalies (ΓÇÿhubsΓÇÖ and ΓÇÿorphansΓÇÖ) cannot yet be understood.
    • a single ΓÇÿhubΓÇÖ was found in the results of one system
      • appeared in nearly 2/5 of result lists
      • removing this one example from the collection of 5000 tracks makes it appear that the system does not suffer from indexing anomolies.
    • What will be the number and coverage of ΓÇÿhubsΓÇÖ in a 100,000 song DB?

Kriswest


Directions for further work on evaluating audio music similarity

  • Establish whether stratified sampling used in the human evals is optimal for producing results that reflect human perception of the quality of music indexes or whether the database should be sampled randomly.
    • will influence selection of a statistic for use in automated evaluations or optimisation exps (artist-filtered genre or the mean artist-filtered genre).
  • Explain the indexing anomalies in some techniques.
  • Determine a safe minimum size for a test collection to be used to predict performance on an ΓÇÿindustrial-sizedΓÇÖ collection
  • Establish optimal granularity or range of granularities for a genre catalogue to be used in this type of evaluation (8, 32 or 256 classes?) and integrate a confusion-cost matrix to reduce the penalisation of confusion between similar genres of music (e.g Punk and Heavy Metal) relative to confusion between highly dissimilar genres (e. Classical and Heavy Metal).

Kriswest


Evaluation Tools in Music-2-Knowledge (M2K)

The tools used to produce the evaluation statistics for MIREX 2006 will be released as part of M2K 1.2 (forthcoming). These tools provide services to:

  • import collection metadata and distance matrices
  • generate a stratified query set
  • extract artist-filtered results (for use in human evaluation exps)
  • calculate any of the evaluation statistics described above.

These tools may be used on the command-line by implementing the MIREX distance matrix file format, with M2K in the Data-2-Knowledge toolkit (D2K) or integrated into existing Java code with the new M2K API.

To obtain a copy of the evaluation tools prior to the M2K 1.2 release, contact Kris West


Comments

The evaluation statistics for the MIREX 2006 Audio music similarity contest seem to support the contention that genre, artist and artist-filtered genre neighbourhood statistics are correlated with the human perception of the performance of music similarity estimators as they all reproduce the ranking produced by the human evaluation. However, the differences between systems in that evaluation are not statistically significant, so no firm conclusion can be made. Average distance statistics produce a different ranking but are intended to correlate with visualisation performance and not search. Kriswest


A statistic for evaluation and use in selection & optimization experiments

As each statitic was found to be correlated with the results of the listening test, any *may* be used to evaluate performance and to guide model optimisation or feature selection/weighting experiments. However, unfiltered genre and artist identification statistics are known to allow overfitting to produce over-optimistic performance estimates. In a model optimisation or feature selection experiment these statistics will be more likely to indicate Audio-similarity performance rather than actual Music-similarity performance and may lead to the selection of sub-optimal features or models. The artist-filtered genre neighbourhood can be used to avoid this effect.

The results from MIREX 2006 do not show a significant drop in performance using the artist-filtered genre statistic as would normally be expected. This may be due to the excessively skewed distribution of examples in the database (roughly 50% of examples are labelled as Rock/Pop, while a further 25% are Rap & Hip-Hop). Hence, the difference between the results produced and the random baseline are not well emphasized. Normalising this statistic by the prior probabilities of examples in the database (taking the mean of the diagonal of the artist-filtered genre confusion matrix) equally weights the contribution of each class to the final statistic and prevents performance on a single class dominating the statistic. This normalised statistic shows a drastic reduction in the performance estimates for each system and increases the relative distance between each of the systems in the evaluation. Kriswest

References

  1. Logan and Salomon (ICME 2001), A Music Similarity Function Based On Signal Analysis.
    One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query.
  2. E. Pampalk, Computational Models of Music Similarity and their Application in Music Information Retrieval.
    PhD thesis, Vienna University of Technology, Austria, March 2006