Difference between revisions of "2006:Audio Music Similarity and Retrieval Results"

Latest revision as of 20:32, 13 May 2010

Introduction

These are the results for the 2006 running of the Audio Music Similarity and Retrieval task set. For background information about this task set please refer to the 2006:Audio Music Similarity and Retrieval page.

Each system was given 5000 songs chosen from "uspop", "uscrap" and "cover song" collections. Each system then returned a 5000x5000 distance matrix. 60 songs were randomly selected as queries and the first 5 most highly ranked songs out of the 5000 were extracted for each query (after filtering out the query itself, returned results from the same artist and members of the cover song collection). Then, for each query, the returned results from all participants were grouped and were evaluated by human graders, each query being evaluated by 3 different graders with two scores (using the Evalutron 6000 system). Graders were asked to provide 1 categorical score with 3 categories: NS,SS,VS as explained below, and one fine score (in the range from 0 to 10). An automated statistical evaluation based on a metadata catalog was also conducted. A desciption and analysis is provided below.

Summary Data on Human Evaluations (Evalutron 6000)

Number of evaluators = 24
Number of evaluation per query/candidate pair = 3
Number of queries per grader = 7~8
Size of the candidate lists = Maximum 30 (with no overlap)
Number of randomly selected queries = 60

General Legend

Team ID

EP = Elias Pampalk
TP = Tim Pohle
VS = Vitor Soares
LR = Thomas Lidy and Andreas Rauber
KWT = Kris West (Trans)
KWL = Kris West (Likely)

=Broad Categories

NS = Not Similar
SS = Somewhat Similar
VS = Very Similar

Calculating Summary Measures

Fine⁽¹⁾ = Sum of fine-grained human similarity decisions (0-10).
PSum⁽¹⁾ = Sum of human broad similarity decisions: NS=0, SS=1, VS=2.
WCsum⁽¹⁾ = 'World Cup' scoring: NS=0, SS=1, VS=3 (rewards Very Similar).
SDsum⁽¹⁾ = 'Stephen Downie' scoring: NS=0, SS=1, VS=4 (strongly rewards Very Similar).
Greater0⁽¹⁾ = NS=0, SS=1, VS=1 (binary relevance judgement).
Greater1⁽¹⁾ = NS=0, SS=0, VS=1 (binary relevance judgement using only Very Similar).

⁽¹⁾Normalized to the range 0 to 1.

Overall Summary Results

	EP	TP	VS	LR	KWT	KWL
Fine	0.430	0.423	0.404	0.393	0.372	0.339
Psum	0.425	0.411	0.388	0.374	0.349	0.313
Wcsum	0.358	0.340	0.323	0.306	0.280	0.248
Sdsum	0.324	0.305	0.290	0.271	0.246	0.216
Greater0	0.627	0.623	0.586	0.579	0.557	0.509
Greater1	0.223	0.199	0.191	0.169	0.142	0.118

download these results as csv

http://staff.aist.go.jp/elias.pampalk/papers/mirex06/friedman.png

This figure shows the official ranking of the submissions computed using a Friedman test. The blue lines indicate significance boundaries at the p=0.05 level. As can be seen, the differences are not significant. For a more detailed description and discussion see [1].

Audio Music Similarity and Retrieval Runtime Data

Team ID		Machine	Run-time(seconds)
EP	feature	beer 6	5889
EP	distant	beer 6	6066
KWT	feature	beer 6	29899
KWT	distant	beer 6	25352
KWL	both	beer 4	47698
LR	feature	beer 4	13794
LR	distant	beer 4	131
TP	feature	beer 8	14333
TP	distant	beer 8	3337

download these results as csv

For a description of the computers the submission ran on see 2006:MIREX_2006_Equipment.

Friedman Test with Multiple Comparisons Results (p=0.05)

The Friedman test was run in MATLAB against the Fine summary data over the 60 queries.
Command: [c,m,h,gnames] = multcompare(stats, 'ctype', 'tukey-kramer','estimate', 'friedman', 'alpha', 0.05);

Friedman's ANOVA Table
Source	SS	df	MS	Chi-sq	Prob>Chi-sq
Columns	84.7333	5	16.9467	24.2905	0.00019091
Error	961.7667	295	3.2602
Total	1046.5	359

download these results as csv

TeamID	TeamID	Lowerbound	Mean	Upperbound	Significance
EP	TP	-0.963	0.008	0.980	FALSE
EP	VS	-0.755	0.217	1.188	FALSE
EP	LR	-0.630	0.342	1.313	FALSE
EP	KWT	-0.030	0.942	1.913	FALSE
EP	KWL	0.320	1.292	2.263	TRUE
TP	VS	-0.763	0.208	1.180	FALSE
TP	LR	-0.638	0.333	1.305	FALSE
TP	KWT	-0.038	0.933	1.905	FALSE
TP	KWL	0.312	1.283	2.255	TRUE
VS	LR	-0.847	0.125	1.097	FALSE
VS	KWT	-0.247	0.725	1.697	FALSE
VS	KWL	0.103	1.075	2.047	TRUE
LR	KWT	-0.372	0.600	1.572	FALSE
LR	KWL	-0.022	0.950	1.922	FALSE
KWT	KWL	-0.622	0.350	1.322	FALSE

download these results as csv

Summary Results by Query

Fine
queryID	EP	TP	VS	LR	KWT	KWL
a001528	0.428	0.318	0.387	0.459	0.354	0.331
a004667	0.429	0.503	0.579	0.529	0.383	0.429
a000518	0.107	0.217	0.221	0.213	0.145	0.206
a002693	0.657	0.495	0.337	0.519	0.345	0.483
a004830	0.338	0.348	0.345	0.311	0.413	0.354
a002784	0.371	0.432	0.280	0.347	0.281	0.282
a005705	0.590	0.739	0.500	0.389	0.337	0.247
a006272	0.258	0.233	0.219	0.149	0.247	0.200
a007005	0.188	0.078	0.166	0.190	0.183	0.085
a008401	0.365	0.242	0.319	0.176	0.514	0.247
a008850	0.111	0.464	0.211	0.211	0.123	0.113
a007054	0.230	0.315	0.354	0.295	0.295	0.343
a008365	0.260	0.344	0.337	0.280	0.374	0.145
b000990	0.477	0.401	0.403	0.366	0.487	0.335
b001799	0.301	0.437	0.277	0.464	0.327	0.303
b001516	0.367	0.579	0.471	0.358	0.329	0.358
b002576	0.138	0.228	0.167	0.213	0.445	0.167
b004483	0.479	0.434	0.266	0.285	0.477	0.264
b006517	0.599	0.739	0.149	0.709	0.615	0.176
b005395	0.374	0.402	0.342	0.511	0.503	0.319
b007493	0.809	0.677	0.736	0.429	0.577	0.785
b005447	0.447	0.342	0.553	0.254	0.426	0.419
b009401	0.639	0.513	0.695	0.612	0.557	0.464
b006979	0.228	0.219	0.442	0.314	0.158	0.224
b012801	0.316	0.215	0.278	0.443	0.419	0.527
b008611	0.337	0.329	0.277	0.299	0.247	0.227
b013992	0.266	0.309	0.334	0.331	0.219	0.207
b015082	0.497	0.530	0.494	0.527	0.280	0.500
b015991	0.818	0.617	0.784	0.787	0.305	0.385
b009364	0.393	0.412	0.535	0.443	0.489	0.384
a007915	0.656	0.500	0.690	0.373	0.248	0.397
a002856	0.309	0.164	0.173	0.167	0.420	0.307
a000751	0.653	0.475	0.353	0.305	0.171	0.244
a002907	0.487	0.509	0.376	0.156	0.269	0.315
a000193	0.344	0.425	0.296	0.305	0.199	0.321
b006599	0.353	0.300	0.363	0.483	0.201	0.245
b010953	0.631	0.564	0.747	0.731	0.497	0.502
a003397	0.325	0.361	0.223	0.315	0.161	0.167
a006525	0.448	0.421	0.192	0.231	0.397	0.155
b012279	0.592	0.306	0.564	0.305	0.537	0.565
a004526	0.497	0.401	0.421	0.461	0.590	0.485
b010504	0.727	0.594	0.631	0.615	0.377	0.428
b017426	0.393	0.423	0.396	0.424	0.449	0.467
b011185	0.611	0.455	0.524	0.547	0.353	0.465
b011453	0.475	0.317	0.292	0.410	0.500	0.535
b006618	0.480	0.534	0.584	0.499	0.526	0.559
b017223	0.059	0.202	0.179	0.202	0.452	0.467
a001530	0.615	0.593	0.337	0.625	0.550	0.312
b019063	0.445	0.403	0.383	0.433	0.398	0.162
b005063	0.587	0.711	0.334	0.235	0.507	0.271
a004035	0.495	0.557	0.530	0.516	0.299	0.292
a003713	0.425	0.409	0.259	0.321	0.467	0.427
b015200	0.198	0.236	0.109	0.198	0.140	0.174
a004755	0.556	0.584	0.699	0.493	0.481	0.437
b019276	0.310	0.447	0.360	0.393	0.210	0.347
b018901	0.711	0.743	0.796	0.717	0.602	0.446
b005570	0.334	0.363	0.424	0.331	0.268	0.298
b006144	0.513	0.565	0.671	0.600	0.537	0.421
b002169	0.274	0.426	0.423	0.344	0.307	0.386
b016133	0.487	0.305	0.441	0.415	0.330	0.226

Ave. Fine Score:	0.430	0.423	0.404	0.393	0.372	0.339
Psum
queryID	EP	TP	VS	LR	KWT	KWL
a001528	0.400	0.300	0.333	0.433	0.300	0.233
a004667	0.367	0.400	0.467	0.467	0.300	0.333
a000518	0.033	0.167	0.200	0.267	0.133	0.267
a002693	0.700	0.467	0.300	0.500	0.367	0.467
a004830	0.300	0.267	0.233	0.267	0.400	0.300
a002784	0.433	0.467	0.333	0.433	0.300	0.267
a005705	0.633	0.800	0.500	0.367	0.267	0.233
a006272	0.167	0.133	0.100	0.000	0.067	0.100
a007005	0.167	0.067	0.167	0.167	0.167	0.133
a008401	0.267	0.133	0.233	0.067	0.567	0.167
a008850	0.033	0.433	0.200	0.133	0.067	0.067
a007054	0.200	0.267	0.333	0.267	0.233	0.400
a008365	0.167	0.300	0.300	0.233	0.367	0.067
b000990	0.567	0.433	0.400	0.367	0.533	0.333
b001799	0.367	0.500	0.267	0.500	0.367	0.367
b001516	0.367	0.633	0.467	0.300	0.333	0.300
b002576	0.100	0.233	0.200	0.233	0.433	0.133
b004483	0.533	0.433	0.200	0.233	0.533	0.200
b006517	0.633	0.833	0.067	0.767	0.667	0.133
b005395	0.467	0.500	0.433	0.633	0.600	0.400
b007493	0.900	0.733	0.867	0.533	0.567	0.900
b005447	0.433	0.400	0.667	0.167	0.467	0.433
b009401	0.733	0.533	0.733	0.700	0.667	0.567
b006979	0.200	0.200	0.433	0.300	0.133	0.200
b012801	0.267	0.100	0.267	0.400	0.433	0.567
b008611	0.267	0.300	0.233	0.200	0.200	0.133
b013992	0.300	0.267	0.267	0.233	0.200	0.067
b015082	0.533	0.633	0.567	0.567	0.267	0.633
b015991	0.967	0.733	0.900	0.967	0.300	0.333
b009364	0.333	0.300	0.500	0.367	0.433	0.267
a007915	0.733	0.567	0.800	0.300	0.267	0.367
a002856	0.300	0.067	0.067	0.133	0.467	0.333
a000751	0.733	0.433	0.267	0.200	0.067	0.100
a002907	0.367	0.500	0.367	0.100	0.200	0.300
a000193	0.367	0.433	0.267	0.233	0.100	0.233
b006599	0.167	0.133	0.167	0.300	0.100	0.167
b010953	0.767	0.600	0.800	0.900	0.500	0.633
a003397	0.300	0.400	0.200	0.267	0.133	0.100
a006525	0.467	0.400	0.167	0.200	0.400	0.067
b012279	0.600	0.233	0.600	0.233	0.533	0.567
a004526	0.433	0.333	0.333	0.433	0.567	0.500
b010504	0.767	0.633	0.700	0.667	0.333	0.367
b017426	0.400	0.533	0.433	0.467	0.467	0.533
b011185	0.533	0.333	0.467	0.500	0.233	0.433
b011453	0.433	0.233	0.200	0.367	0.433	0.433
b006618	0.533	0.567	0.633	0.533	0.500	0.600
b017223	0.033	0.100	0.133	0.167	0.500	0.500
a001530	0.733	0.633	0.367	0.667	0.600	0.333
b019063	0.500	0.433	0.367	0.400	0.367	0.100
b005063	0.567	0.700	0.367	0.167	0.533	0.233
a004035	0.500	0.567	0.633	0.533	0.167	0.233
a003713	0.400	0.333	0.067	0.267	0.433	0.367
b015200	0.167	0.267	0.067	0.200	0.100	0.200
a004755	0.467	0.533	0.600	0.367	0.400	0.333
b019276	0.300	0.500	0.333	0.467	0.133	0.333
b018901	0.700	0.667	0.800	0.700	0.600	0.367
b005570	0.333	0.367	0.367	0.333	0.200	0.200
b006144	0.433	0.533	0.667	0.600	0.433	0.333
b002169	0.167	0.400	0.467	0.333	0.233	0.367
b016133	0.467	0.267	0.433	0.333	0.300	0.167

Ave. Psum Score:	0.425	0.411	0.388	0.374	0.349	0.313
WCsum
queryID	EP	TP	VS	LR	KWT	KWL
a001528	0.289	0.200	0.244	0.333	0.222	0.178
a004667	0.289	0.289	0.356	0.356	0.200	0.244
a000518	0.022	0.111	0.133	0.200	0.089	0.200
a002693	0.622	0.333	0.200	0.400	0.289	0.400
a004830	0.200	0.178	0.156	0.178	0.289	0.200
a002784	0.356	0.378	0.267	0.356	0.222	0.200
a005705	0.578	0.733	0.422	0.267	0.178	0.156
a006272	0.111	0.089	0.067	0.000	0.044	0.067
a007005	0.111	0.044	0.111	0.133	0.111	0.089
a008401	0.200	0.133	0.200	0.044	0.511	0.111
a008850	0.022	0.333	0.133	0.089	0.044	0.044
a007054	0.156	0.222	0.244	0.200	0.156	0.311
a008365	0.111	0.200	0.200	0.178	0.289	0.044
b000990	0.533	0.356	0.311	0.289	0.467	0.244
b001799	0.267	0.400	0.178	0.378	0.289	0.244
b001516	0.311	0.600	0.422	0.267	0.311	0.267
b002576	0.067	0.156	0.133	0.156	0.356	0.111
b004483	0.422	0.311	0.156	0.200	0.444	0.133
b006517	0.556	0.800	0.044	0.711	0.622	0.111
b005395	0.378	0.422	0.356	0.556	0.533	0.311
b007493	0.867	0.711	0.822	0.467	0.511	0.867
b005447	0.378	0.356	0.600	0.133	0.400	0.378
b009401	0.644	0.400	0.644	0.600	0.556	0.489
b006979	0.156	0.156	0.378	0.267	0.111	0.178
b012801	0.200	0.067	0.200	0.311	0.289	0.467
b008611	0.200	0.222	0.200	0.133	0.133	0.089
b013992	0.200	0.178	0.178	0.156	0.133	0.044
b015082	0.489	0.556	0.467	0.467	0.222	0.578
b015991	0.956	0.667	0.867	0.956	0.267	0.267
b009364	0.244	0.200	0.400	0.311	0.356	0.178
a007915	0.667	0.489	0.756	0.244	0.200	0.311
a002856	0.222	0.044	0.067	0.089	0.356	0.244
a000751	0.689	0.422	0.200	0.133	0.044	0.067
a002907	0.311	0.444	0.311	0.067	0.133	0.222
a000193	0.311	0.378	0.222	0.178	0.067	0.178
b006599	0.111	0.089	0.111	0.244	0.067	0.111
b010953	0.711	0.511	0.756	0.867	0.422	0.578
a003397	0.222	0.333	0.156	0.222	0.111	0.067
a006525	0.444	0.378	0.133	0.200	0.378	0.044
b012279	0.533	0.200	0.556	0.200	0.467	0.489
a004526	0.356	0.244	0.289	0.378	0.467	0.378
b010504	0.689	0.533	0.644	0.622	0.289	0.289
b017426	0.333	0.444	0.311	0.378	0.356	0.444
b011185	0.422	0.267	0.333	0.378	0.178	0.311
b011453	0.356	0.178	0.133	0.289	0.311	0.356
b006618	0.422	0.511	0.556	0.444	0.378	0.511
b017223	0.022	0.067	0.111	0.111	0.422	0.422
a001530	0.644	0.511	0.311	0.556	0.489	0.244
b019063	0.422	0.356	0.289	0.311	0.311	0.067
b005063	0.511	0.622	0.267	0.111	0.400	0.178
a004035	0.422	0.489	0.556	0.422	0.111	0.178
a003713	0.356	0.267	0.044	0.200	0.356	0.267
b015200	0.111	0.200	0.044	0.133	0.067	0.133
a004755	0.378	0.444	0.489	0.244	0.311	0.267
b019276	0.222	0.378	0.267	0.378	0.089	0.222
b018901	0.600	0.556	0.733	0.600	0.533	0.244
b005570	0.244	0.289	0.289	0.222	0.133	0.133
b006144	0.333	0.422	0.578	0.533	0.356	0.289
b002169	0.111	0.378	0.378	0.244	0.156	0.311
b016133	0.356	0.178	0.378	0.244	0.222	0.133

Ave. WCsum Score:	0.358	0.340	0.323	0.306	0.280	0.248
Sdsum
queryID	EP	TP	VS	LR	KWT	KWL
a001528	0.233	0.150	0.200	0.283	0.183	0.150
a004667	0.250	0.233	0.300	0.300	0.150	0.200
a000518	0.017	0.083	0.100	0.167	0.067	0.167
a002693	0.583	0.267	0.150	0.350	0.250	0.367
a004830	0.150	0.133	0.117	0.133	0.233	0.150
a002784	0.317	0.333	0.233	0.317	0.183	0.167
a005705	0.550	0.700	0.383	0.217	0.133	0.117
a006272	0.083	0.067	0.050	0.000	0.033	0.050
a007005	0.083	0.033	0.083	0.117	0.083	0.067
a008401	0.167	0.133	0.183	0.033	0.483	0.083
a008850	0.017	0.283	0.100	0.067	0.033	0.033
a007054	0.133	0.200	0.200	0.167	0.117	0.267
a008365	0.083	0.150	0.150	0.150	0.250	0.033
b000990	0.517	0.317	0.267	0.250	0.433	0.200
b001799	0.217	0.350	0.133	0.317	0.250	0.183
b001516	0.283	0.583	0.400	0.250	0.300	0.250
b002576	0.050	0.117	0.100	0.117	0.317	0.100
b004483	0.367	0.250	0.133	0.183	0.400	0.100
b006517	0.517	0.783	0.033	0.683	0.600	0.100
b005395	0.333	0.383	0.317	0.517	0.500	0.267
b007493	0.850	0.700	0.800	0.433	0.483	0.850
b005447	0.350	0.333	0.567	0.117	0.367	0.350
b009401	0.600	0.333	0.600	0.550	0.500	0.450
b006979	0.133	0.133	0.350	0.250	0.100	0.167
b012801	0.167	0.050	0.167	0.267	0.217	0.417
b008611	0.167	0.183	0.183	0.100	0.100	0.067
b013992	0.150	0.133	0.133	0.117	0.100	0.033
b015082	0.467	0.517	0.417	0.417	0.200	0.550
b015991	0.950	0.633	0.850	0.950	0.250	0.233
b009364	0.200	0.150	0.350	0.283	0.317	0.133
a007915	0.633	0.450	0.733	0.217	0.167	0.283
a002856	0.183	0.033	0.067	0.067	0.300	0.200
a000751	0.667	0.417	0.167	0.100	0.033	0.050
a002907	0.283	0.417	0.283	0.050	0.100	0.183
a000193	0.283	0.350	0.200	0.150	0.050	0.150
b006599	0.083	0.067	0.083	0.217	0.050	0.083
b010953	0.683	0.467	0.733	0.850	0.383	0.550
a003397	0.183	0.300	0.133	0.200	0.100	0.050
a006525	0.433	0.367	0.117	0.200	0.367	0.033
b012279	0.500	0.183	0.533	0.183	0.433	0.450
a004526	0.317	0.200	0.267	0.350	0.417	0.317
b010504	0.650	0.483	0.617	0.600	0.267	0.250
b017426	0.300	0.400	0.250	0.333	0.300	0.400
b011185	0.367	0.233	0.267	0.317	0.150	0.250
b011453	0.317	0.150	0.100	0.250	0.250	0.317
b006618	0.367	0.483	0.517	0.400	0.317	0.467
b017223	0.017	0.050	0.100	0.083	0.383	0.383
a001530	0.600	0.450	0.283	0.500	0.433	0.200
b019063	0.383	0.317	0.250	0.267	0.283	0.050
b005063	0.483	0.583	0.217	0.083	0.333	0.150
a004035	0.383	0.450	0.517	0.367	0.083	0.150
a003713	0.333	0.233	0.033	0.167	0.317	0.217
b015200	0.083	0.167	0.033	0.100	0.050	0.100
a004755	0.333	0.400	0.433	0.183	0.267	0.233
b019276	0.183	0.317	0.233	0.333	0.067	0.167
b018901	0.550	0.500	0.700	0.550	0.500	0.183
b005570	0.200	0.250	0.250	0.167	0.100	0.100
b006144	0.283	0.367	0.533	0.500	0.317	0.267
b002169	0.083	0.367	0.333	0.200	0.117	0.283
b016133	0.300	0.133	0.350	0.200	0.183	0.117

Ave. SDsum Score:	0.324	0.305	0.290	0.271	0.246	0.216
Greater0
queryID	EP	TP	VS	LR	KWT	KWL
a001528	0.733	0.600	0.600	0.733	0.533	0.400
a004667	0.600	0.733	0.800	0.800	0.600	0.600
a000518	0.067	0.333	0.400	0.467	0.267	0.467
a002693	0.933	0.867	0.600	0.800	0.600	0.667
a004830	0.600	0.533	0.467	0.533	0.733	0.600
a002784	0.667	0.733	0.533	0.667	0.533	0.467
a005705	0.800	1.000	0.733	0.667	0.533	0.467
a006272	0.333	0.267	0.200	0.000	0.133	0.200
a007005	0.333	0.133	0.333	0.267	0.333	0.267
a008401	0.467	0.133	0.333	0.133	0.733	0.333
a008850	0.067	0.733	0.400	0.267	0.133	0.133
a007054	0.333	0.400	0.600	0.467	0.467	0.667
a008365	0.333	0.600	0.600	0.400	0.600	0.133
b000990	0.667	0.667	0.667	0.600	0.733	0.600
b001799	0.667	0.800	0.533	0.867	0.600	0.733
b001516	0.533	0.733	0.600	0.400	0.400	0.400
b002576	0.200	0.467	0.400	0.467	0.667	0.200
b004483	0.867	0.800	0.333	0.333	0.800	0.400
b006517	0.867	0.933	0.133	0.933	0.800	0.200
b005395	0.733	0.733	0.667	0.867	0.800	0.667
b007493	1.000	0.800	1.000	0.733	0.733	1.000
b005447	0.600	0.533	0.867	0.267	0.667	0.600
b009401	1.000	0.933	1.000	1.000	1.000	0.800
b006979	0.333	0.333	0.600	0.400	0.200	0.267
b012801	0.467	0.200	0.467	0.667	0.867	0.867
b008611	0.467	0.533	0.333	0.400	0.400	0.267
b013992	0.600	0.533	0.533	0.467	0.400	0.133
b015082	0.667	0.867	0.867	0.867	0.400	0.800
b015991	1.000	0.933	1.000	1.000	0.400	0.533
b009364	0.600	0.600	0.800	0.533	0.667	0.533
a007915	0.933	0.800	0.933	0.467	0.467	0.533
a002856	0.533	0.133	0.067	0.267	0.800	0.600
a000751	0.867	0.467	0.467	0.400	0.133	0.200
a002907	0.533	0.667	0.533	0.200	0.400	0.533
a000193	0.533	0.600	0.400	0.400	0.200	0.400
b006599	0.333	0.267	0.333	0.467	0.200	0.333
b010953	0.933	0.867	0.933	1.000	0.733	0.800
a003397	0.533	0.600	0.333	0.400	0.200	0.200
a006525	0.533	0.467	0.267	0.200	0.467	0.133
b012279	0.800	0.333	0.733	0.333	0.733	0.800
a004526	0.667	0.600	0.467	0.600	0.867	0.867
b010504	1.000	0.933	0.867	0.800	0.467	0.600
b017426	0.600	0.800	0.800	0.733	0.800	0.800
b011185	0.867	0.533	0.867	0.867	0.400	0.800
b011453	0.667	0.400	0.400	0.600	0.800	0.667
b006618	0.867	0.733	0.867	0.800	0.867	0.867
b017223	0.067	0.200	0.200	0.333	0.733	0.733
a001530	1.000	1.000	0.533	1.000	0.933	0.600
b019063	0.733	0.667	0.600	0.667	0.533	0.200
b005063	0.733	0.933	0.667	0.333	0.933	0.400
a004035	0.733	0.800	0.867	0.867	0.333	0.400
a003713	0.533	0.533	0.133	0.467	0.667	0.667
b015200	0.333	0.467	0.133	0.400	0.200	0.400
a004755	0.733	0.800	0.933	0.733	0.667	0.533
b019276	0.533	0.867	0.533	0.733	0.267	0.667
b018901	1.000	1.000	1.000	1.000	0.800	0.733
b005570	0.600	0.600	0.600	0.667	0.400	0.400
b006144	0.733	0.867	0.933	0.800	0.667	0.467
b002169	0.333	0.467	0.733	0.600	0.467	0.533
b016133	0.800	0.533	0.600	0.600	0.533	0.267

Ave. greater0 Score:	0.627	0.623	0.586	0.579	0.557	0.509

Greater1
queryID	EP	TP	VS	LR	KWT	KWL
a001528	0.067	0.000	0.067	0.133	0.067	0.067
a004667	0.133	0.067	0.133	0.133	0.000	0.067
a000518	0.000	0.000	0.000	0.067	0.000	0.067
a002693	0.467	0.067	0.000	0.200	0.133	0.267
a004830	0.000	0.000	0.000	0.000	0.067	0.000
a002784	0.200	0.200	0.133	0.200	0.067	0.067
a005705	0.467	0.600	0.267	0.067	0.000	0.000
a006272	0.000	0.000	0.000	0.000	0.000	0.000
a007005	0.000	0.000	0.000	0.067	0.000	0.000
a008401	0.067	0.133	0.133	0.000	0.400	0.000
a008850	0.000	0.133	0.000	0.000	0.000	0.000
a007054	0.067	0.133	0.067	0.067	0.000	0.133
a008365	0.000	0.000	0.000	0.067	0.133	0.000
b000990	0.467	0.200	0.133	0.133	0.333	0.067
b001799	0.067	0.200	0.000	0.133	0.133	0.000
b001516	0.200	0.533	0.333	0.200	0.267	0.200
b002576	0.000	0.000	0.000	0.000	0.200	0.067
b004483	0.200	0.067	0.067	0.133	0.267	0.000
b006517	0.400	0.733	0.000	0.600	0.533	0.067
b005395	0.200	0.267	0.200	0.400	0.400	0.133
b007493	0.800	0.667	0.733	0.333	0.400	0.800
b005447	0.267	0.267	0.467	0.067	0.267	0.267
b009401	0.467	0.133	0.467	0.400	0.333	0.333
b006979	0.067	0.067	0.267	0.200	0.067	0.133
b012801	0.067	0.000	0.067	0.133	0.000	0.267
b008611	0.067	0.067	0.133	0.000	0.000	0.000
b013992	0.000	0.000	0.000	0.000	0.000	0.000
b015082	0.400	0.400	0.267	0.267	0.133	0.467
b015991	0.933	0.533	0.800	0.933	0.200	0.133
b009364	0.067	0.000	0.200	0.200	0.200	0.000
a007915	0.533	0.333	0.667	0.133	0.067	0.200
a002856	0.067	0.000	0.067	0.000	0.133	0.067
a000751	0.600	0.400	0.067	0.000	0.000	0.000
a002907	0.200	0.333	0.200	0.000	0.000	0.067
a000193	0.200	0.267	0.133	0.067	0.000	0.067
b006599	0.000	0.000	0.000	0.133	0.000	0.000
b010953	0.600	0.333	0.667	0.800	0.267	0.467
a003397	0.067	0.200	0.067	0.133	0.067	0.000
a006525	0.400	0.333	0.067	0.200	0.333	0.000
b012279	0.400	0.133	0.467	0.133	0.333	0.333
a004526	0.200	0.067	0.200	0.267	0.267	0.133
b010504	0.533	0.333	0.533	0.533	0.200	0.133
b017426	0.200	0.267	0.067	0.200	0.133	0.267
b011185	0.200	0.133	0.067	0.133	0.067	0.067
b011453	0.200	0.067	0.000	0.133	0.067	0.200
b006618	0.200	0.400	0.400	0.267	0.133	0.333
b017223	0.000	0.000	0.067	0.000	0.267	0.267
a001530	0.467	0.267	0.200	0.333	0.267	0.067
b019063	0.267	0.200	0.133	0.133	0.200	0.000
b005063	0.400	0.467	0.067	0.000	0.133	0.067
a004035	0.267	0.333	0.400	0.200	0.000	0.067
a003713	0.267	0.133	0.000	0.067	0.200	0.067
b015200	0.000	0.067	0.000	0.000	0.000	0.000
a004755	0.200	0.267	0.267	0.000	0.133	0.133
b019276	0.067	0.133	0.133	0.200	0.000	0.000
b018901	0.400	0.333	0.600	0.400	0.400	0.000
b005570	0.067	0.133	0.133	0.000	0.000	0.000
b006144	0.133	0.200	0.400	0.400	0.200	0.200
b002169	0.000	0.333	0.200	0.067	0.000	0.200
b016133	0.133	0.000	0.267	0.067	0.067	0.067

Ave. greater1 Score:	0.223	0.199	0.191	0.169	0.142	0.118

download these results as csv

Raw Scores

The raw data derived from the Evalutron 6000 human evaluations are located on the 2006:Audio Music Similarity and Retrieval Raw Data page.

Query Meta Data

queryID	artist	genre
a001528	Xpression	Jazz
a004667	The Tony Rich Project	R&B
a000518	Junior C.	Reggae
a002693	B.J. Thomas	Country
a004830	Luciano & Co	Reggae
a002784	Elton John	Rock
a005705	Jessica	R&B
a006272	Orlando Barroso	Latin
a007005	Big Time Operator	Jazz
a008401	Prince Malachi	Reggae
a008850	Elida y Avante	Latin
a007054	Profyle	R&B
a008365	Barbara Sfraga	Jazz
b000990	Guns N' Roses	Rock
b001799	Enya	New Age
b001516	Britney Spears	Rock
b002576	Depeche Mode	Rock
b004483	Elvis Costello	Rock
b006517	Paul Van Dyk	Electronica & Dance
b005395	Ozzy Osbourne	Rock
b007493	Eminem	Rap & Hip Hop
b005447	Mudvayne	Rock
b009401	Ja Rule	Rap & Hip Hop
b006979	Cat Stevens	Rock
b012801	The Chemical Brothers	Electronica & Dance
b008611	The Cranberries	Rock
b013992	Enigma	New Age
b015082	DMX	Rap & Hip Hop
b015991	Tim McGraw	Country
b009364	Bon Jovi	Rock
a007915	Victor Sanz	Country
a002856	Atomic Babies	Electronica & Dance
a000751	Brian Hughes	Jazz
a002907	Gary Meek	Jazz
a000193	Mercurio	Latin
b006599	Selena	Latin
b010953	Jessica Andrews	Country
a003397	Roy Davis Jr.	Electronica & Dance
a006525	Wind Machine	New Age
b012279	OutKast	Rap & Hip Hop
a004526	Shannon	R&B
b010504	LL Cool J	Rap & Hip Hop
b017426	Shaggy	Reggae
b011185	Sting	Rock
b011453	Neil Young	Rock
b006618	Foo Fighters	Rock
b017223	Nirvana	Rock
a001530	M?tley Cr?e	Rock
b019063	Smashing Pumpkins	Rock
b005063	Sublime	Rock
a004035	Toy-Box	Electronica & Dance
a003713	Brian Bromberg	Jazz
b015200	Mike Oldfield	New Age
a004755	Profyle	R&B
b019276	Robbie Williams	Rock
b018901	Nelly	Rap & Hip Hop
b005570	Everything But the Girl	Rock
b006144	Def Leppard	Rock
b002169	No Doubt	Rock
b016133	Janet Jackson	Rock

download these results as csv

Results from Automatic Evaluation

	Pohle	Pampalk	Lidy & Rauber	West (Trans)	West (Likely)
top20genre%	60.84%	60.64%	56.96%	53.18%	47.76%
top20artist%	41.32%	34.73%	27.73%	20.68%	15.85%
top20album%	36.57%	30.54%	32.16%	24.72%	19.42%
artist-filtered genre%	58.91%	60.70%	56.71%	54.06%	49.20%
mean artist-filtered genre%	27.27%	28.27%	26.01%	21.62%	19.56%
avg dist - genre	0.6970	0.9924	0.9524	0.9738	0.9830
avg dist - artist	0.4244	0.9772	0.7339	0.8734	0.6010
avg dist - album	0.3721	0.9758	0.7205	0.8689	0.5702
triangular inequality	32.02%	100.00%	100.00%	100.00%	55.08%
top20always-sim	260	1928	137	173	90
top20never-sim%	0.0%	0.0%	0.0%	0.0%	0.0%

download these results as csv

Other Results from Automatic Evaluation

See 2006:Audio Music Similarity and Retrieval Other Automatic Evaluation Results page.

Introduction to automatic evaluation

Automated evaluation of music similarity techniques based on a metadata catalogue has several advantages:

It does not require costly human ΓÇÿgradersΓÇÖ
Allows testing of incremental changes in indexing algorithms
Can achieve complete coverage over the test collection
Provides a target for machine-learning, feature-selection and optimisation experiments
Can predict the visualisation performance of an indexing technique
Can identify indexing ΓÇÿanomoliesΓÇÖ in the indices tested

Automated ΓÇÿpseudo-objectiveΓÇÖ evaluation of music similarity estimation techniques was introduced by Logan & Saloman [1] and were shown to be highly correlated with careful human-based evaluations by Pampalk [2]. The results of this contest support the conclusions of Pampalk [2] although further work is required to fully understand the evaluation statistics.

Description of evaluation statistics

The evaluation statistics

Neighbourhood clustering (artist, genre, album): average % of the top N results for each query in the collection with the same same label

Artist-filtered genre neighbourhood: average % of the top N results for each query belonging to the same genre label, ignoring matches from the same artist (ensures that results reflect musical not audio similarity)

Mean Artist-filtered genre neighbourhood: normalised form of the above statistic equally weighting each genre, penalising lop-sided performance.

Normalised average distance between examples: average distance between examples with the same label, indicates degree of clustering and potential for visual organisation of a collection

Always similar (hubs): largest # of times an example appears in top N results for other queries, a result that appears too often will adversely affect performance without affecting other statistics

Never similar (orphans): % of examples that never appear in a top N result list and cannot be retrieved by search

Triangular inequality (metric space): indicates whether the function produces a metric distance space and therefore what visualisation techniques may be applied to it

Normalisation

Each of the neighbourhood statistics, described above, has been normalised by the number of examples of each class (a genre, album or artist) that is available in the test database. E.g. if the collection contained 20 tracks by a particular artist and a particular system retrieved 10 of those examples in its top 50 results it would normally achieve an artist neighbourhood score of 20%, while the normalised form of the metric would report a score of 50% (of the available matches were retrieved). Such normalisation is intended to avoid bias introduced into the results by skewed distribution of the examples according to each label set.

The mean artist-filtered genre neighbourhood is a normalised form of the artist-filter genre neighbourhood metric, which gives equal weight to performance of a system on each genre class. This version of the statistic is intended to match the prior probabilities or distribtuion of examples according to genre labels used as queries in the human listening test (where an equal number of examples from each class was selected - stratified random sampling) instead of the prior probabilities or distribution of examples appearing in the database.

Music-similarity evaluation issues

Care must be taken with all evaluations of audio musical similarity estimation techniques as there is a great potential for over-fitting in these experiments and for over-optimistic estimates of the performance of a system on novel test data to be produced.

The metadata catalog used to conduct automated evaluations should be as accurate as possible. However, this technique seems relatively robust to a degree of noise in the catalogue, parhaps due to its coarse granularity.

Small test collections do not allow us to accurately predict performance on larger test collections, for example:

Indexing anomalies (ΓÇÿhubsΓÇÖ and ΓÇÿorphansΓÇÖ) cannot yet be understood.
- a single ΓÇÿhubΓÇÖ was found in the results of one system
  - appeared in nearly 2/5 of result lists
  - removing this one example from the collection of 5000 tracks makes it appear that the system does not suffer from indexing anomolies.
- What will be the number and coverage of ΓÇÿhubsΓÇÖ in a 100,000 song DB?

Kriswest

Directions for further work on evaluating audio music similarity

Establish whether stratified sampling used in the human evals is optimal for producing results that reflect human perception of the quality of music indexes or whether the database should be sampled randomly.
- will influence selection of a statistic for use in automated evaluations or optimisation exps (artist-filtered genre or the mean artist-filtered genre).
Explain the indexing anomalies in some techniques.
Determine a safe minimum size for a test collection to be used to predict performance on an ΓÇÿindustrial-sizedΓÇÖ collection
Establish optimal granularity or range of granularities for a genre catalogue to be used in this type of evaluation (8, 32 or 256 classes?) and integrate a confusion-cost matrix to reduce the penalisation of confusion between similar genres of music (e.g Punk and Heavy Metal) relative to confusion between highly dissimilar genres (e. Classical and Heavy Metal).

Kriswest

Evaluation Tools in Music-2-Knowledge (M2K)

The tools used to produce the evaluation statistics for MIREX 2006 will be released as part of M2K 1.2 (forthcoming). These tools provide services to:

import collection metadata and distance matrices
generate a stratified query set
extract artist-filtered results (for use in human evaluation exps)
calculate any of the evaluation statistics described above.

These tools may be used on the command-line by implementing the MIREX distance matrix file format, with M2K in the Data-2-Knowledge toolkit (D2K) or integrated into existing Java code with the new M2K API.

To obtain a copy of the evaluation tools prior to the M2K 1.2 release, contact Kris West

Comments

The evaluation statistics for the MIREX 2006 Audio music similarity contest seem to support the contention that genre, artist and artist-filtered genre neighbourhood statistics are correlated with the human perception of the performance of music similarity estimators as they all reproduce the ranking produced by the human evaluation. However, the differences between systems in that evaluation are not statistically significant, so no firm conclusion can be made. Average distance statistics produce a different ranking but are intended to correlate with visualisation performance and not search. Kriswest

A statistic for evaluation and use in selection & optimization experiments

As each statitic was found to be correlated with the results of the listening test, any *may* be used to evaluate performance and to guide model optimisation or feature selection/weighting experiments. However, unfiltered genre and artist identification statistics are known to allow overfitting to produce over-optimistic performance estimates. In a model optimisation or feature selection experiment these statistics will be more likely to indicate Audio-similarity performance rather than actual Music-similarity performance and may lead to the selection of sub-optimal features or models. The artist-filtered genre neighbourhood can be used to avoid this effect.

The results from MIREX 2006 do not show a significant drop in performance using the artist-filtered genre statistic as would normally be expected. This may be due to the excessively skewed distribution of examples in the database (roughly 50% of examples are labelled as Rock/Pop, while a further 25% are Rap & Hip-Hop). Hence, the difference between the results produced and the random baseline are not well emphasized. Normalising this statistic by the prior probabilities of examples in the database (taking the mean of the diagonal of the artist-filtered genre confusion matrix) equally weights the contribution of each class to the final statistic and prevents performance on a single class dominating the statistic. This normalised statistic shows a drastic reduction in the performance estimates for each system and increases the relative distance between each of the systems in the evaluation. Kriswest

References

Logan and Salomon (ICME 2001), A Music Similarity Function Based On Signal Analysis.
One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query.
E. Pampalk, Computational Models of Music Similarity and their Application in Music Information Retrieval.
PhD thesis, Vienna University of Technology, Austria, March 2006

@@ Line 1: / Line 1: @@
 [[Category: Results]]
 ==Introduction==
-These are the results for the 2006 running of the Audio Music Similarity and Retrieval task set. For background information about this task set please refer to the [[Audio Music Similarity and Retrieval]] page.
+These are the results for the 2006 running of the Audio Music Similarity and Retrieval task set. For background information about this task set please refer to the [[2006:Audio Music Similarity and Retrieval]] page.
-Each system was given 5000 songs chosen from "uspop", "uscrap" and "cover song" collections. Each system then returned a 5000x5000 distance matrix. 60 songs were randomly selected as queries and the first 5 most highly ranked songs out of the 5000 were extracted for each query (after filtering out the query itself, returned results from the same artist and members of the cover song collection). Then, for each query, the returned results from all participants were grouped and were evaluated by human graders, each query being evaluated by 3 different graders with two scores (using the Evalutron 6000 system). Graders were asked to provide 1 categorical score with 3 categories: NS,SS,VS as explained below, and one fine score (in the range from 0 to 10).
+Each system was given 5000 songs chosen from "uspop", "uscrap" and "cover song" collections. Each system then returned a 5000x5000 distance matrix. 60 songs were randomly selected as queries and the first 5 most highly ranked songs out of the 5000 were extracted for each query (after filtering out the query itself, returned results from the same artist and members of the cover song collection). Then, for each query, the returned results from all participants were grouped and were evaluated by human graders, each query being evaluated by 3 different graders with two scores (using the Evalutron 6000 system). Graders were asked to provide 1 categorical score with 3 categories: NS,SS,VS as explained below, and one fine score (in the range from 0 to 10). An automated statistical evaluation based on a metadata catalog was also conducted. A desciption and analysis is provided below.
-====Summary Data on Human Evaluations (Evalutron 6000)====
+===Summary Data on Human Evaluations (Evalutron 6000)===
 '''Number of evaluators''' = 24<br />
 '''Number of evaluation per query/candidate pair''' = 3<br />
@@ Line 12: / Line 12: @@
 '''Number of randomly selected queries''' = 60<br />
-===General Legend===
+====General Legend====
-====Team ID====
+=====Team ID=====
-'''EP''' = [https://www.music-ir.org/evaluation/MIREX/2006_abstracts/AS_pampalk.pdf Elias Pampalk]<br />
+'''EP''' = [https://www.music-ir.org/mirex/abstracts/2006/AS_pampalk.pdf Elias Pampalk]<br />
-'''TP''' = [https://www.music-ir.org/evaluation/MIREX/2006_abstracts/AS_pohle.pdf Tim Pohle]<br />
+'''TP''' = [https://www.music-ir.org/mirex/abstracts/2006/AS_pohle.pdf Tim Pohle]<br />
 '''VS''' = Vitor Soares<br />
-'''LR''' = [https://www.music-ir.org/evaluation/MIREX/2006_abstracts/AS_lidy.pdf Thomas Lidy and Andreas Rauber]<br />
+'''LR''' = [https://www.music-ir.org/mirex/abstracts/2006/AS_lidy.pdf Thomas Lidy and Andreas Rauber]<br />
 '''KWT''' = Kris West (Trans)<br />
 '''KWL''' = Kris West (Likely)<br />
-====Broad Categories====
+=====Broad Categories====
 '''NS''' = Not Similar<br />
 '''SS''' = Somewhat Similar<br />
 '''VS''' = Very Similar<br />
-===Calculating Summary Measures===
+=====Calculating Summary Measures=====
 '''Fine'''<sup>(1)</sup> = Sum of fine-grained human similarity decisions (0-10). <br />
 '''PSum'''<sup>(1)</sup> = Sum of human broad similarity decisions: NS=0, SS=1, VS=2. <br />
@@ Line 36: / Line 36: @@
 <sup>(1)</sup>Normalized to the range 0 to 1.
-==Overall Summary Results==
+===Overall Summary Results===
-<csv>mirex06_as_overalllist.csv</csv>
+<csv>2006/mirex06_as_overalllist.csv</csv>
 http://staff.aist.go.jp/elias.pampalk/papers/mirex06/friedman.png
@@ Line 44: / Line 44: @@
 This figure shows the official ranking of the submissions computed using a Friedman test. The blue lines indicate significance boundaries at the p=0.05 level. As can be seen, the differences are not significant. For a more detailed description and discussion see [http://staff.aist.go.jp/elias.pampalk/papers/pam_mirex06.pdf].
-===Audio Music Similarity and Retrieval Runtime Data===
+====Audio Music Similarity and Retrieval Runtime Data====
-<csv>as06_runtime.csv</csv>
+<csv>2006/as06_runtime.csv</csv>
-For a description of the computers the submission ran on see [[MIREX_2006_Equipment]].
+For a description of the computers the submission ran on see [[2006:MIREX_2006_Equipment]].
-==Friedman Test with Multiple Comparisons Results (p=0.05)==
+===Friedman Test with Multiple Comparisons Results (p=0.05)===
 The Friedman test was run in MATLAB against the Fine summary data over the 60 queries.<br />
 Command: [c,m,h,gnames] = multcompare(stats, 'ctype', 'tukey-kramer','estimate', 'friedman', 'alpha', 0.05);
-<csv>AV_sum_friedman.csv</csv>
+<csv>2006/AV_sum_friedman.csv</csv>
-<csv>AV_fine_result.csv</csv>
+<csv>2006/AV_fine_result.csv</csv>
-==Summary Results by Query==
+===Summary Results by Query===
-<csv>mirex06_as_uberlist.csv</csv>
+<csv>2006/mirex06_as_uberlist.csv</csv>
+===Raw Scores===
+The raw data derived from the Evalutron 6000 human evaluations are located on the [[2006:Audio Music Similarity and Retrieval Raw Data]] page.
+===Query Meta Data===
+<csv>2006/as06_queries.csv</csv>
-==Raw Scores==
-The raw data derived from the Evalutron 6000 human evaluations are located on the [[Audio Music Similarity and Retrieval Raw Data]] page.
-==Query Meta Data==
-<csv>as06_queries.csv</csv>
 ==Results from Automatic Evaluation==
-<csv>as06_nonhuman_results.csv</csv>
+<csv>2006/as06_nonhuman_results.csv</csv>
 === Other Results from Automatic Evaluation===
-See [[Audio Music Similarity and Retrieval Other Automatic Evaluation Results]] page.
+See [[2006:Audio Music Similarity and Retrieval Other Automatic Evaluation Results]] page.
@@ Line 138: / Line 140: @@
 * Establish optimal granularity or range of granularities for a genre catalogue to be used in this type of evaluation (8, 32 or 256 classes?) and integrate a confusion-cost matrix to reduce the penalisation of confusion between similar genres of music (e.g Punk and Heavy Metal) relative to confusion between highly dissimilar genres (e. Classical and Heavy Metal).
 [[User:Kriswest|Kriswest]]
 === Evaluation Tools in Music-2-Knowledge (M2K) ===
@@ Line 152: / Line 155: @@
 === Comments ===
-The evaluation statistics for the MIREX 2006 Audio music similarity contest seem to support the contention that genre, artist and artist-filtered genre neighbourhood statistics are correlated with the human perception of the performance of music similarity estimators as they all reproduce the ranking produced by the human evaluation. However, the differences between systems in that evaluation are not statistically significant, so no firm conclusion can be made. Average distance statistics produce a different ranking but are intended to correlate with visualisation performance and not search. [[User:Kriswest|Kriswest]]
+The evaluation statistics for the MIREX 2006 Audio music similarity contest seem to support the contention that genre, artist and artist-filtered genre neighbourhood statistics are correlated with the human perception of the performance of music similarity estimators as they all reproduce the ranking produced by the human evaluation. However, the differences between systems in that evaluation are not statistically significant, so no firm conclusion can be made. Average distance statistics produce a different ranking but are intended to correlate with visualisation performance and not search.
+[[User:Kriswest|Kriswest]]
@@ Line 158: / Line 162: @@
 As each statitic was found to be correlated with the results of the listening test, any *may* be used   to evaluate performance and to guide model optimisation or feature selection/weighting experiments. However, unfiltered genre and artist identification statistics are known to allow overfitting to produce over-optimistic performance estimates. In a model optimisation or feature selection experiment these statistics will be more likely to indicate '''Audio-similarity''' performance rather than actual '''Music-similarity''' performance and may lead to the selection of sub-optimal features or models. The artist-filtered genre neighbourhood can be used to avoid this effect.
-The results from MIREX 2006 do not show a significant drop in performance using the artist-filtered genre statistic as would normally be expected. This may be due to the excessively skewed distribution of examples in the database (roughly 50% of edxamples are labelled as Rock/Pop, while a further 25% are Rap & Hip-Hop). Normalising this statistic by the prior probabilities of examples in the database (taking the mean of the diagonal of the artist-filtered genre confusion matrix) equally weights the contribution of each class to the final statistic and prevents performance on a single class dominating the statistic. This normalised statistic shows a drastic reduction in the performance estimates for each system and increases the relative distance between each of the systems in the evaluation.
+The results from MIREX 2006 do not show a significant drop in performance using the artist-filtered genre statistic as would normally be expected. This may be due to the excessively skewed distribution of examples in the database (roughly 50% of examples are labelled as Rock/Pop, while a further 25% are Rap & Hip-Hop). Hence, the difference between the results produced and the random baseline are not well emphasized. Normalising this statistic by the prior probabilities of examples in the database (taking the mean of the diagonal of the artist-filtered genre confusion matrix) equally weights the contribution of each class to the final statistic and prevents performance on a single class dominating the statistic. This normalised statistic shows a drastic reduction in the performance estimates for each system and increases the relative distance between each of the systems in the evaluation.
 [[User:Kriswest|Kriswest]]
 === References ===
 # [http://gatekeeper.research.compaq.com/pub/compaq/CRL/publications/logan/icme2001_logan.pdf Logan and Salomon (ICME 2001), '''A Music Similarity Function Based On Signal Analysis'''].<br>One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query.
 # [http://www.ofai.at/~elias.pampalk/publications/pampalk06thesis.pdf E. Pampalk, '''Computational Models of Music Similarity and their Application in Music Information Retrieval.'''] <br>PhD thesis, Vienna University of Technology, Austria, March 2006

Difference between revisions of "2006:Audio Music Similarity and Retrieval Results"

Latest revision as of 20:32, 13 May 2010

Contents

Introduction

Summary Data on Human Evaluations (Evalutron 6000)

General Legend

Team ID

=Broad Categories

Calculating Summary Measures

Overall Summary Results

Audio Music Similarity and Retrieval Runtime Data

Friedman Test with Multiple Comparisons Results (p=0.05)

Summary Results by Query

Raw Scores

Query Meta Data

Results from Automatic Evaluation

Other Results from Automatic Evaluation

Introduction to automatic evaluation

Description of evaluation statistics

Normalisation

Music-similarity evaluation issues

Directions for further work on evaluating audio music similarity

Evaluation Tools in Music-2-Knowledge (M2K)

Comments

A statistic for evaluation and use in selection & optimization experiments

References

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools