Difference between revisions of "2007:Audio Cover Song Identification"

From MIREX Wiki
m (Comments)
Line 21: Line 21:
 
dpwe: In addition to a ranking task, I think it would be interesting to include a detection task i.e. does a cover version of this track exist in this database or not?  This is essentially just setting a threshold on the metric used in ranking the returns, but different algorithms may be better or worse at setting such a threshold that is *consistent* between different tracks.  This could be computed as a Receiver Operating Characteristic i.e. False Alarms vs. False Reject curve as a threshold is varied on the similarity scores returned by the unmodified algorithms.
 
dpwe: In addition to a ranking task, I think it would be interesting to include a detection task i.e. does a cover version of this track exist in this database or not?  This is essentially just setting a threshold on the metric used in ranking the returns, but different algorithms may be better or worse at setting such a threshold that is *consistent* between different tracks.  This could be computed as a Receiver Operating Characteristic i.e. False Alarms vs. False Reject curve as a threshold is varied on the similarity scores returned by the unmodified algorithms.
  
jserra:
+
--[[User:Jserra|jserra]] 11:58, 14 June 2007 (CDT):
  
 
- I agree on using AP for ranking instead of MRR.
 
- I agree on using AP for ranking instead of MRR.

Revision as of 11:58, 14 June 2007

Overview

The Cover Song task was a new task for MIREX 2006. It is closely related to the Audio Music Similarity and Retrieval task as the cover songs to be found will be embedded in the Audio Music Similarity and Retrieval test collection. Hence, the setup and running of the algorithms will be the same as that of the Audio Music Similarity and Retrieval so please refer to the Audio Music Similarity and Retrieval for more details.

Task Description

Within the Audio Music Similarity and Retrieval database, there are embedded 30 different "cover songs" each represented by 11 different "versions" for a total of 330 audio files (16bit, monophonic, 22.05khz, wav). The "cover songs" represent a variety of genres (e.g., classical, jazz, gospel, rock, folk-rock, etc.) and the variations span a variety of styles and orchestrations.

Using each of these cover song files in turn as as the "seed/query" file, we will examine the top 10 returned items for the presence of the other 10 versions of the "seed/query" file.

Participants in the Audio Music Similarity and Retrieval task really need not do anything extra as the matrices returned will contain all the necessary information for us to automatically conduct the evaluations. We do encourage, however, the participation of those systems that might be especially designed to detect "cover song" variants.

Evaluation

We could employ the same measures used in 2006:Audio Cover Song.

Comments

Evaluation measures: Perhaps the MRR of the 1st correctly classified instance could be changed to the MRR of the whole 10 answers...

dpwe: Average Precision is a popular and well-behaved measure for scoring the ranking of multiple correct answers in a retrieval task. It is calculated from a full list of ranked results as the average of the precisions (proportion of returns that are relevant) calculated when the ranked list is cut off at each true item. So if there are 4 true items, and they occur at ranks 1, 3, 4, and 6, the average precision is 0.25*(1/1 + 2/3 + 3/4 + 4/6) = 0.77. It has the nice properties of not cutting off the return list at some particular length, and of progressively discounting the contribution of items ranked deep in the list. It's also widely used in multimedia retrieval, so people are used to it. http://en.wikipedia.org/wiki/Information_retrieval#Average_precision

dpwe: In addition to a ranking task, I think it would be interesting to include a detection task i.e. does a cover version of this track exist in this database or not? This is essentially just setting a threshold on the metric used in ranking the returns, but different algorithms may be better or worse at setting such a threshold that is *consistent* between different tracks. This could be computed as a Receiver Operating Characteristic i.e. False Alarms vs. False Reject curve as a threshold is varied on the similarity scores returned by the unmodified algorithms.

--jserra 11:58, 14 June 2007 (CDT):

- I agree on using AP for ranking instead of MRR.

- Regarding detection, I think a simple recall measure with the, say, 10 first answers would be enough. Or perhaps a mean number of detected cover songs within these 10 first ranked elements (like last year) would be more straightforward and intuitive. All this, of course, taking into account that all cover groups have the same number of items.

- I will propose to use geometric means (i.e. GMAP) to average results between queries. When averaging over all queries, this penalizes very bad answers to a given query. Geometric mean is commonly used in TREC (see here).

- So, my proposal is: GMAP and Geometric mean of the number of correctly classified covers in the 10 first retrieved documents.

Moderators

Potential Participants

  • Joan Serr├á (jserra at iua dot upf dot edu) and Emilia G├│mez (egomez at iua dot upf dot edu)
  • Dan Ellis (dpwe at ee dot columbia dot edu)
  • Juan P. Bello (jpbello at nyu dot edu)
  • Kyogu Lee (kglee at ccrma dot stanford dot edu)