2012:Audio Music Similarity and Retrieval Results

From MIREX Wiki

Introduction

These are the results for the 2012 running of the Audio Music Similarity and Retrieval task set. For background information about this task set please refer to the Audio Music Similarity and Retrieval page.

Each system was given 7000 songs chosen from IMIRSEL's "uspop", "uscrap" and "american" "classical" and "sundry" collections. Each system then returned a 7000x7000 distance matrix. 50 songs were randomly selected from the 10 genre groups (5 per genre) as queries and the first 5 most highly ranked songs out of the 7000 were extracted for each query (after filtering out the query itself, returned results from the same artist were also omitted). Then, for each query, the returned results (candidates) from all participants were grouped and were evaluated by human graders using the Evalutron 6000 grading system. Each individual query/candidate set was evaluated by a single grader. For each query/candidate pair, graders provided two scores. Graders were asked to provide 1 categorical BROAD score with 3 categories: NS,SS,VS as explained below, and one FINE score (in the range from 0 to 100). A description and analysis is provided below.

The systems read in 30 second audio clips as their raw data. The same 30 second clips were used in the grading stage.


General Legend

Team ID

Sub code Submission name Abstract Contributors
DM6 DM6 PDF Franz de Leon, Kirk Martinez
DM7 DM7 PDF Franz de Leon, Kirk Martinez
GT3 MarsyasSimilarity PDF George Tzanetakis
JR2 modulationSim PDF Jia-Min Ren, Jyh-Shing Roger Jang
NHHL1 AMSR_2012_1 PDF Byeong-jun Han, Kyogu Lee,Juhan Nam,Jorge Herrera
NHHL2 AMSR_2012_2 PDF Byeong-jun Han, Kyogu Lee,Juhan Nam,Jorge Herrera
PS1 PS09 PDF Dominik Schnitzer, Tim Pohle
RW4 modulationSimFrameUBM PDF Jia-Min Ren,Ming-Ju Wu,Jyh-Shing Roger Jang
SSKP1 cbmr_sim_2010 PDF Klaus Seyerlehner, Markus Schedl, Peter Knees, Tim Pohle
SSKS2 cbmr_sim_2011 PDF Klaus Seyerlehner, Markus Schedl, Peter Knees, Reinhard Sonnleitner

Broad Categories

NS = Not Similar
SS = Somewhat Similar
VS = Very Similar

Understanding Summary Measures

Fine = Has a range from 0 (failure) to 100 (perfection).
Broad = Has a range from 0 (failure) to 2 (perfection) as each query/candidate pair is scored with either NS=0, SS=1 or VS=2.

Human Evaluation

Overall Summary Results

Measure DM6 DM7 GT3 JR2 NHHL1 NHHL2 PS1 RW4 SSKP1 SSKS2
Average Fine Score 36.176 36.332 44.872 47.020 45.944 45.944 53.136 50.000 52.640 53.188
Average Cat Score 0.680 0.682 0.894 0.956 0.926 0.926 1.128 1.048 1.138 1.132

download these results as csv

Friedman's Tests

Friedman's Test (FINE Scores)

The Friedman test was run in MATLAB against the Fine summary data over the 50 queries.
Command: [c,m,h,gnames] = multcompare(stats, 'ctype', 'tukey-kramer','estimate', 'friedman', 'alpha', 0.05);

TeamID TeamID Lowerbound Mean Upperbound Significance
SSKS2 PS1 -2.014 -0.110 1.794 FALSE
SSKS2 SSKP1 -1.684 0.220 2.124 FALSE
SSKS2 RW4 -1.284 0.620 2.524 FALSE
SSKS2 JR2 -0.164 1.740 3.644 FALSE
SSKS2 NHHL2 0.596 2.500 4.404 TRUE
SSKS2 NHHL1 0.596 2.500 4.404 TRUE
SSKS2 GT3 0.616 2.520 4.424 TRUE
SSKS2 DM7 2.726 4.630 6.534 TRUE
SSKS2 DM6 2.776 4.680 6.584 TRUE
PS1 SSKP1 -1.574 0.330 2.234 FALSE
PS1 RW4 -1.174 0.730 2.634 FALSE
PS1 JR2 -0.054 1.850 3.754 FALSE
PS1 NHHL2 0.706 2.610 4.514 TRUE
PS1 NHHL1 0.706 2.610 4.514 TRUE
PS1 GT3 0.726 2.630 4.534 TRUE
PS1 DM7 2.836 4.740 6.644 TRUE
PS1 DM6 2.886 4.790 6.694 TRUE
SSKP1 RW4 -1.504 0.400 2.304 FALSE
SSKP1 JR2 -0.384 1.520 3.424 FALSE
SSKP1 NHHL2 0.376 2.280 4.184 TRUE
SSKP1 NHHL1 0.376 2.280 4.184 TRUE
SSKP1 GT3 0.396 2.300 4.204 TRUE
SSKP1 DM7 2.506 4.410 6.314 TRUE
SSKP1 DM6 2.556 4.460 6.364 TRUE
RW4 JR2 -0.784 1.120 3.024 FALSE
RW4 NHHL2 -0.024 1.880 3.784 FALSE
RW4 NHHL1 -0.024 1.880 3.784 FALSE
RW4 GT3 -0.004 1.900 3.804 FALSE
RW4 DM7 2.106 4.010 5.914 TRUE
RW4 DM6 2.156 4.060 5.964 TRUE
JR2 NHHL2 -1.144 0.760 2.664 FALSE
JR2 NHHL1 -1.144 0.760 2.664 FALSE
JR2 GT3 -1.124 0.780 2.684 FALSE
JR2 DM7 0.986 2.890 4.794 TRUE
JR2 DM6 1.036 2.940 4.844 TRUE
NHHL2 NHHL1 -1.904 0.000 1.904 FALSE
NHHL2 GT3 -1.884 0.020 1.924 FALSE
NHHL2 DM7 0.226 2.130 4.034 TRUE
NHHL2 DM6 0.276 2.180 4.084 TRUE
NHHL1 GT3 -1.884 0.020 1.924 FALSE
NHHL1 DM7 0.226 2.130 4.034 TRUE
NHHL1 DM6 0.276 2.180 4.084 TRUE
GT3 DM7 0.206 2.110 4.014 TRUE
GT3 DM6 0.256 2.160 4.064 TRUE
DM7 DM6 -1.854 0.050 1.954 FALSE

download these results as csv

Evalutron.fine.friedman.tukeyKramerHSD.png

Friedman's Test (BROAD Scores)

The Friedman test was run in MATLAB against the BROAD summary data over the 50 queries.
Command: [c,m,h,gnames] = multcompare(stats, 'ctype', 'tukey-kramer','estimate', 'friedman', 'alpha', 0.05);

TeamID TeamID Lowerbound Mean Upperbound Significance
SSKP1 SSKS2 -2.052 -0.210 1.632 FALSE
SSKP1 PS1 -1.682 0.160 2.002 FALSE
SSKP1 RW4 -1.022 0.820 2.662 FALSE
SSKP1 JR2 -0.262 1.580 3.422 FALSE
SSKP1 NHHL2 0.488 2.330 4.172 TRUE
SSKP1 NHHL1 0.488 2.330 4.172 TRUE
SSKP1 GT3 0.388 2.230 4.072 TRUE
SSKP1 DM7 2.538 4.380 6.222 TRUE
SSKP1 DM6 2.538 4.380 6.222 TRUE
SSKS2 PS1 -1.472 0.370 2.212 FALSE
SSKS2 RW4 -0.812 1.030 2.872 FALSE
SSKS2 JR2 -0.052 1.790 3.632 FALSE
SSKS2 NHHL2 0.698 2.540 4.382 TRUE
SSKS2 NHHL1 0.698 2.540 4.382 TRUE
SSKS2 GT3 0.598 2.440 4.282 TRUE
SSKS2 DM7 2.748 4.590 6.432 TRUE
SSKS2 DM6 2.748 4.590 6.432 TRUE
PS1 RW4 -1.182 0.660 2.502 FALSE
PS1 JR2 -0.422 1.420 3.262 FALSE
PS1 NHHL2 0.328 2.170 4.012 TRUE
PS1 NHHL1 0.328 2.170 4.012 TRUE
PS1 GT3 0.228 2.070 3.912 TRUE
PS1 DM7 2.378 4.220 6.062 TRUE
PS1 DM6 2.378 4.220 6.062 TRUE
RW4 JR2 -1.082 0.760 2.602 FALSE
RW4 NHHL2 -0.332 1.510 3.352 FALSE
RW4 NHHL1 -0.332 1.510 3.352 FALSE
RW4 GT3 -0.432 1.410 3.252 FALSE
RW4 DM7 1.718 3.560 5.402 TRUE
RW4 DM6 1.718 3.560 5.402 TRUE
JR2 NHHL2 -1.092 0.750 2.592 FALSE
JR2 NHHL1 -1.092 0.750 2.592 FALSE
JR2 GT3 -1.192 0.650 2.492 FALSE
JR2 DM7 0.958 2.800 4.642 TRUE
JR2 DM6 0.958 2.800 4.642 TRUE
NHHL2 NHHL1 -1.842 0.000 1.842 FALSE
NHHL2 GT3 -1.942 -0.100 1.742 FALSE
NHHL2 DM7 0.208 2.050 3.892 TRUE
NHHL2 DM6 0.208 2.050 3.892 TRUE
NHHL1 GT3 -1.942 -0.100 1.742 FALSE
NHHL1 DM7 0.208 2.050 3.892 TRUE
NHHL1 DM6 0.208 2.050 3.892 TRUE
GT3 DM7 0.308 2.150 3.992 TRUE
GT3 DM6 0.308 2.150 3.992 TRUE
DM7 DM6 -1.842 0.000 1.842 FALSE

download these results as csv

Evalutron.cat.friedman.tukeyKramerHSD.png

Summary Results by Query

FINE Scores

These are the mean FINE scores per query assigned by Evalutron graders. The FINE scores for the 5 candidates returned per algorithm, per query, have been averaged. Values are bounded between 0 and 100. A perfect score would be 100. Genre labels have been included for reference.

Genre Query DM6 DM7 GT3 JR2 NHHL1 NHHL2 PS1 RW4 SSKP1 SSKS2
BAROQUE d005709 44.7 45.1 77.5 77.4 53.3 53.3 60.0 77.8 44.9 50.7
BAROQUE d006218 9.9 9.9 27.0 31.2 34.3 34.3 54.8 31.2 42.2 34.6
BAROQUE d010595 69.0 72.0 64.0 72.0 64.0 64.0 72.5 69.0 69.0 76.5
BAROQUE d016827 21.9 21.4 30.5 16.4 11.6 11.6 19.7 29.5 25.2 24.0
BAROQUE d019925 76.1 77.4 82.0 82.5 83.1 83.1 86.3 85.8 85.9 85.0
BLUES e003462 13.1 13.1 25.8 21.3 24.9 24.9 22.6 24.0 25.2 19.6
BLUES e006719 55.0 56.0 76.0 63.0 80.0 80.0 74.0 75.0 69.5 74.5
BLUES e013942 55.5 52.0 69.0 64.0 57.0 57.0 71.0 72.0 77.0 73.0
BLUES e014478 37.3 40.0 9.8 24.0 30.9 30.9 31.4 19.4 21.1 23.3
BLUES e019782 62.7 59.2 74.8 74.4 82.6 82.6 88.0 75.8 87.9 76.0
CLASSICAL d006152 61.1 53.9 91.3 91.3 88.4 88.4 91.4 91.7 76.9 91.4
CLASSICAL d009811 12.0 12.0 21.8 14.1 3.4 3.4 22.7 31.0 17.3 26.7
CLASSICAL d015395 13.0 13.0 60.6 63.9 64.2 64.2 67.0 66.3 68.8 69.0
CLASSICAL d016084 33.0 33.0 69.0 64.5 50.0 50.0 67.5 72.0 59.5 71.5
CLASSICAL d018315 20.0 20.0 63.0 63.5 64.5 64.5 70.7 64.5 60.5 63.0
COUNTRY b003088 31.4 32.7 63.0 64.1 69.4 69.4 63.9 66.8 70.1 65.6
COUNTRY e008540 29.3 29.3 54.0 63.2 51.0 51.0 51.5 66.9 52.0 63.0
COUNTRY e012590 26.0 26.0 38.0 41.0 25.0 25.0 56.0 44.0 46.0 44.0
COUNTRY e014995 35.2 35.2 41.5 41.6 43.3 43.3 43.3 43.5 40.6 42.6
COUNTRY e016359 4.8 4.8 0.0 17.6 6.0 6.0 10.1 9.6 0.0 11.2
EDANCE b006191 8.3 8.3 11.9 12.5 11.3 11.3 19.1 13.9 32.4 37.9
EDANCE b011724 56.5 56.5 46.5 58.0 52.0 52.0 69.0 57.5 73.0 70.0
EDANCE b013180 48.2 48.2 39.7 40.5 37.9 37.9 59.7 48.4 59.6 52.2
EDANCE f010038 16.5 15.4 27.7 40.8 31.3 31.3 50.8 34.7 53.7 47.9
EDANCE f016289 6.0 5.2 15.9 3.4 14.1 14.1 15.7 10.7 35.4 37.7
JAZZ e002496 18.3 21.2 29.8 25.0 7.8 7.8 38.1 32.7 38.5 33.4
JAZZ e003502 74.0 74.0 50.0 55.0 70.0 70.0 78.0 71.0 89.0 88.0
JAZZ e011411 69.9 69.9 56.4 80.4 70.3 70.3 78.4 71.5 67.6 54.4
JAZZ e014617 26.5 29.5 22.0 17.1 68.9 68.9 88.0 59.1 83.5 78.7
JAZZ e019789 29.5 29.5 30.1 18.5 49.4 49.4 57.8 20.5 39.5 36.3
METAL b006857 50.5 50.5 54.5 64.5 55.7 55.7 49.4 64.2 65.5 61.4
METAL b009281 63.5 63.5 75.5 83.5 81.0 81.0 71.5 83.5 82.5 80.0
METAL b014284 41.0 44.5 35.5 46.0 60.0 60.0 67.5 46.0 65.5 69.0
METAL b014839 25.7 25.7 31.3 32.3 38.4 38.4 38.7 29.5 24.7 31.2
METAL b017570 16.4 12.6 19.5 17.2 13.2 13.2 14.1 14.4 21.9 26.5
RAPHIPHOP a002038 32.2 32.2 34.5 50.2 44.3 44.3 56.1 54.4 59.1 57.5
RAPHIPHOP a002900 25.4 25.4 37.0 29.7 28.2 28.2 39.7 39.7 28.8 40.0
RAPHIPHOP a007956 60.7 60.7 69.2 73.0 61.8 61.8 63.7 73.1 76.6 75.1
RAPHIPHOP a009690 51.5 51.5 67.5 45.5 58.0 58.0 58.5 49.0 61.5 68.0
RAPHIPHOP b004382 72.7 72.7 76.6 77.8 79.2 79.2 81.9 81.2 80.4 79.5
ROCKROLL b000859 25.7 31.5 45.7 37.0 55.5 55.5 34.6 43.0 21.7 41.2
ROCKROLL b008224 36.1 34.4 36.7 43.8 33.5 33.5 24.9 28.5 47.9 51.7
ROCKROLL b010359 5.8 5.8 19.7 13.1 10.8 10.8 15.6 10.4 22.9 18.5
ROCKROLL b010640 7.2 7.2 19.0 26.4 22.1 22.1 24.5 17.0 23.3 26.1
ROCKROLL b017313 11.5 11.5 17.3 17.0 9.0 9.0 24.0 21.5 22.0 19.0
ROMANTIC d000185 66.8 66.8 84.0 84.8 81.6 81.6 88.2 87.9 86.8 86.6
ROMANTIC d007856 70.8 75.8 56.6 77.9 77.7 77.7 84.8 81.8 74.7 76.5
ROMANTIC d011611 31.8 31.8 35.1 43.0 38.6 38.6 63.9 50.4 59.6 53.2
ROMANTIC d011697 7.3 7.3 28.0 33.5 24.0 24.0 27.2 33.7 31.2 22.6
ROMANTIC d012432 41.5 41.5 31.8 52.6 24.7 24.7 49.0 55.0 63.6 54.1

download these results as csv

BROAD Scores

These are the mean BROAD scores per query assigned by Evalutron graders. The BROAD scores for the 5 candidates returned per algorithm, per query, have been averaged. Values are bounded between 0 (not similar) and 2 (very similar). A perfect score would be 2. Genre labels have been included for reference.

Genre Query DM6 DM7 GT3 JR2 NHHL1 NHHL2 PS1 RW4 SSKP1 SSKS2
BAROQUE d005709 1.0 1.0 1.9 1.9 1.1 1.1 1.4 1.9 0.9 1.1
BAROQUE d006218 0.0 0.0 0.3 0.4 0.5 0.5 1.2 0.4 0.8 0.6
BAROQUE d010595 1.3 1.3 1.2 1.4 1.3 1.3 1.4 1.3 1.4 1.4
BAROQUE d016827 0.4 0.4 0.9 0.4 0.2 0.2 0.4 0.9 0.4 0.5
BAROQUE d019925 1.5 1.6 1.7 1.6 1.8 1.8 2.0 1.9 1.9 1.9
BLUES e003462 0.0 0.0 0.5 0.3 0.6 0.6 0.4 0.4 0.5 0.2
BLUES e006719 1.1 1.2 1.7 1.1 1.9 1.9 1.5 1.5 1.4 1.7
BLUES e013942 1.1 1.0 1.5 1.3 1.2 1.2 1.5 1.5 1.6 1.6
BLUES e014478 0.6 0.7 0.1 0.4 0.5 0.5 0.8 0.3 0.3 0.2
BLUES e019782 1.3 1.2 1.6 1.6 1.9 1.9 2.0 1.6 2.0 1.6
CLASSICAL d006152 1.4 1.2 2.0 2.0 2.0 2.0 2.0 2.0 1.8 2.0
CLASSICAL d009811 0.3 0.3 0.5 0.3 0.0 0.0 0.5 0.7 0.4 0.7
CLASSICAL d015395 0.2 0.2 1.4 1.7 1.4 1.4 1.6 1.6 1.7 1.7
CLASSICAL d016084 0.6 0.6 1.5 1.4 0.9 0.9 1.3 1.4 1.2 1.6
CLASSICAL d018315 0.0 0.0 1.1 1.0 1.0 1.0 1.2 1.0 1.0 1.1
COUNTRY b003088 0.4 0.4 1.4 1.5 1.6 1.6 1.3 1.6 1.5 1.3
COUNTRY e008540 0.5 0.5 1.0 1.3 1.2 1.2 1.1 1.4 1.2 1.4
COUNTRY e012590 0.4 0.4 0.7 0.8 0.3 0.3 1.3 0.9 0.9 0.9
COUNTRY e014995 0.7 0.7 1.0 0.9 1.0 1.0 1.0 1.0 1.0 1.0
COUNTRY e016359 0.0 0.0 0.0 0.3 0.1 0.1 0.1 0.1 0.0 0.2
EDANCE b006191 0.0 0.0 0.1 0.1 0.0 0.0 0.2 0.1 0.8 0.8
EDANCE b011724 1.1 1.1 0.9 1.2 1.0 1.0 1.5 1.2 1.6 1.5
EDANCE b013180 1.1 1.1 0.7 0.8 0.7 0.7 1.4 1.1 1.5 1.2
EDANCE f010038 0.1 0.1 0.3 0.6 0.5 0.5 1.0 0.5 1.1 0.8
EDANCE f016289 0.1 0.1 0.4 0.0 0.3 0.3 0.5 0.2 0.9 0.9
JAZZ e002496 0.4 0.5 0.8 0.7 0.0 0.0 0.8 0.9 0.9 0.8
JAZZ e003502 1.5 1.5 0.7 0.9 1.3 1.3 1.4 1.4 1.9 1.8
JAZZ e011411 1.3 1.3 0.8 1.8 1.3 1.3 1.7 1.6 1.1 0.7
JAZZ e014617 0.4 0.5 0.4 0.2 1.7 1.7 1.9 1.4 1.8 1.8
JAZZ e019789 0.7 0.7 0.5 0.2 1.1 1.1 1.1 0.2 0.8 0.7
METAL b006857 0.9 0.9 1.1 1.3 1.0 1.0 1.0 1.3 1.4 1.2
METAL b009281 1.4 1.4 1.8 2.0 1.8 1.8 1.6 2.0 2.0 2.0
METAL b014284 0.9 0.9 0.3 0.9 1.4 1.4 1.6 0.9 1.6 1.8
METAL b014839 0.3 0.3 0.3 0.5 0.7 0.7 0.6 0.3 0.3 0.5
METAL b017570 0.2 0.1 0.3 0.3 0.1 0.1 0.2 0.2 0.4 0.5
RAPHIPHOP a002038 0.5 0.5 0.7 1.0 1.0 1.0 1.4 1.3 1.5 1.5
RAPHIPHOP a002900 0.6 0.6 0.7 0.7 0.6 0.6 0.5 0.9 0.7 0.7
RAPHIPHOP a007956 1.4 1.4 1.6 1.7 1.4 1.4 1.5 1.6 1.8 1.9
RAPHIPHOP a009690 1.0 1.0 1.4 0.7 1.2 1.2 1.1 0.8 1.2 1.4
RAPHIPHOP b004382 1.6 1.6 2.0 2.0 2.0 2.0 2.0 2.0 1.9 1.9
ROCKROLL b000859 0.5 0.6 0.9 0.7 1.1 1.1 0.7 0.9 0.3 0.7
ROCKROLL b008224 0.5 0.4 0.5 0.7 0.4 0.4 0.2 0.3 0.9 1.0
ROCKROLL b010359 0.0 0.0 0.3 0.0 0.0 0.0 0.3 0.0 0.4 0.2
ROCKROLL b010640 0.1 0.1 0.4 0.5 0.4 0.4 0.8 0.4 0.7 0.7
ROCKROLL b017313 0.6 0.6 0.7 0.6 0.6 0.6 0.8 0.8 0.8 0.7
ROMANTIC d000185 1.4 1.4 1.7 1.9 1.6 1.6 2.0 2.0 2.0 2.0
ROMANTIC d007856 1.2 1.3 1.0 1.6 1.4 1.4 1.8 1.9 1.4 1.5
ROMANTIC d011611 0.5 0.5 0.5 0.9 0.6 0.6 1.4 1.1 1.2 1.1
ROMANTIC d011697 0.0 0.0 0.4 0.6 0.3 0.3 0.5 0.6 0.6 0.4
ROMANTIC d012432 0.9 0.9 0.5 1.1 0.3 0.3 0.9 1.1 1.5 1.2

download these results as csv

Raw Scores

The raw data derived from the Evalutron 6000 human evaluations are located on the 2012:Audio Music Similarity and Retrieval Raw Data page.

Metadata and Distance Space Evaluation

The following reports provide evaluation statistics based on analysis of the distance space and metadata matches and include:

  • Neighbourhood clustering by artist, album and genre
  • Artist-filtered genre clustering
  • How often the triangular inequality holds
  • Statistics on 'hubs' (tracks similar to many tracks) and orphans (tracks that are not similar to any other tracks at N results).

Reports

DM6 = Franz de Leon, Kirk Martinez
DM7 = Franz de Leon, Kirk Martinez
GT3 = George Tzanetakis
JR2 = Jia-Min Ren, Jyh-Shing Roger Jang
NHHL1 = Byeong-jun Han, Kyogu Lee,Juhan Nam,Jorge Herrera
NHHL2 = Byeong-jun Han, Kyogu Lee,Juhan Nam,Jorge Herrera
PS1 = Dominik Schnitzer, Tim Pohle
RW4 = Jia-Min Ren,Ming-Ju Wu,Jyh-Shing Roger Jang
SSKP1 = Klaus Seyerlehner, Markus Schedl, Peter Knees, Tim Pohle
SSKP2 = Klaus Seyerlehner, Markus Schedl, Peter Knees, Reinhard Sonnleitner