Difference between revisions of "2006:Audio Music Similarity and Retrieval"

From MIREX Wiki
(Factors to evaluate)
m (Robot: Automated text replacement (-]]]] +]]))
 
(95 intermediate revisions by 8 users not shown)
Line 1: Line 1:
== Overview ==
+
== Results ==  
This page is devoted to discussions of the evaluation of Audio Music Similarity algorithms at MIREX 2006. Discussions have already begun on the [https://mail.lis.uiuc.edu/mailman/listinfo/mrx-com00 MIREX 06 "AudioSim06" contest planning list] and will be briefly digested here. A full digest of the discussions is available to subscribers from the [https://mail.lis.uiuc.edu/mailman/private/mrx-com00/ MIREX 06 "AudioSim06" contest planning list archives].
+
Results are on [[2006:Audio Music Similarity and Retrieval Results]] page.
  
As consensus is achieved on the planning list, a full proposal ([[Audio Music Similarity proposal]]) will be produced for the format of the evaluation, including pseudocode for the evaluation metric and suggested formats for submitted algorithms. A skeleton of proposal is already available on the [[Audio Music Similarity proposal]] page.
+
== Introduction ==
 +
As the size of digitial music collections grow, music similarity has an increasingly important role as an aid to music discovery.  A music similarity system can help a music consumer find new music by finding the music that is most musically similar to specific query songs (or is nearest to songs that the consumer already likes).  However, evaluating music similarity systems is inherently difficult since music similarity is a subjective characteristic.  Some attempts have been made to evaluate similarity systems with some degree of success (See the [[#Related Papers]]).  In general, due, to the lack of ground truth for music similarity, other evaluations such as genre and artist identification have been used as surrogates for music similarity. 
 +
 
 +
This year, for the first time, we are attempting a large scale music similarity evaluation as part of MIREX.  This evaluation will rely primarily on human judgement to rank the various submission.  We will also collect various objective measures related to genre and artist clustering, with the hope that we will find some correlation between these objective measures and the human evaluations.
 +
 
 +
This page presents the Audio Music Similarity Evaluation, including the submission rules and formats. Additionally background information can be found here that should help explain some of the reasoning behind the approach taken in the evaluation.
 +
 
 +
'''Evaluation Summary:'''
 +
 
 +
* We will be soliciting contribution to two distinct tracks: Audio Music Search & [[2006:Audio Cover Song]]
 +
* The division between these two tracks will be emphasized in the evaluation results, although results will be directly comparable (all evaluations will be performed for both tracks).
 +
* The intention of the Music Audio Search track is to evaluate music similarity searches (A music search engine that takes a single song as a query), not playlist generation or music recommendation.
 +
* Any criteria may be used to implement the search although we are not really considering socio-cultural context or lyrics.
 +
* Any models, codebooks etc. '''must''' be trained in advance. No collection specific optimisations should be used.
 +
* Please avoid use of the [[http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html USPOP]] and USCRAP collections in your training as they will form the test database. Please also avoid any other overlap with the test data that you can identify.
 +
 
 +
==Evalutron 6000 Issues==
 +
Please go to the [[2006:Evalutron6000 Issues]] page for detailed information about how the human evaluations component will be undertaken.
 +
 
 +
== Important Dates ==
 +
 
 +
* '''Before August 14th''': test and then submit algorithms, start running computations on the DB.
 +
* '''Before August 21th''': finish running all submissions.
 +
* '''After August 21st''': start human evaluations.
 +
 
 +
(See also the MIREX 2006 [[2006:Main Page]].)
 +
 
 +
All runs must be complete by August 21st in order to take part in the human evaluation. This means that the final submission deadline falls on '''August 14th at the absolute latest'''! No submission will be considered after this date (there will be absolutely no extensions to this deadline) and your submission must actually work by the 14th (no additional debugging time). Therefore, it is highly recommended that you submit before this date and ensure that you have run the DistanceMatrix test on the Wiki. Remember this is not a flexible deadline and you should submit long before it falls. (Announced by Kris on July 28)
 +
 
 +
== Runtime (Computational Resources) ==
 +
 
 +
We have not really discussed runtimes this year. This year's competition database will contain ~5000 files and I suspect some systems may need quite a long while to process this. If your system is likely to take more than 36 hours to process this collection this should be clearly stated in your readme file and you should make its likely runtime known on the list now. Given these long runtimes it would also be preferable to make your run resumable, i.e. save your extracted feature files to the supplied cache dir location (or a sub-directory of it). Please include details of any directory structures to be created in the cache dir in your readme file and ensure you acknowledge any training/tuning data that you have used in the development of your submission. (Announced by Kris on July 28)
 +
 
 +
== Outstanding Issues ==
 +
There are still a few issues that need to be resolved before this task is finalized:
 +
* Binary relative judgements vs. Absolute - we've had much debate over this.  Most researchers seem to prefer the binary relative approach, but that places an '''extreme''' burden on the evaluators.
 +
* We are awaiting the final description of the Evalutron
 +
 
 +
== Packaging your Submission ==
 +
* Be sure that your submission follows the [[#Submission_Format]] outlined below.
 +
* Be sure that your submission accepts the proper [[#Input_File]] format
 +
* Be sure that your submission produces the proper [[#Output_File]] format, which can be tested with the DistanceMatrix test harness [[#Test_harness]]
 +
* Be sure to follow the [[https://www.music-ir.org/mirex2006/index.php/Best_Coding_Practices_for_MIREX Best Coding Practices for MIREX]]
 +
* Be sure to follow the [[2006:MIREX 2006 Submission Instructions]]
 +
* In the README file that is included with your submission, please answer the following additional questions:
 +
** Was the submission trained or tuned?
 +
** What was the source of the data used?
 +
** Are you aware of any overlap between the data and either the USCRAP, USPOP or Cover song collections?
 +
** Approximately how long will the submission take to process ~5000 wav files?
 +
** Approximately how much scratch disk space will the submission need to store any feature/cache files?
 +
* Submit your system via the URL located at the bottom of [[2006:MIREX 2006 Submission Instructions]] page
 +
 
 +
 
 +
Note that the information that you place in the README file is '''extremely''' important in ensuring that your submission is evaluated properly.
 +
 
 +
== The Evaluation Database ==
 +
The specifications of the evaluation database will be as follows:
 +
* 22.05kHz, mono, 16bit, WAV files
 +
* The WAV files will be decoded from 192kbit Variable-bit-rate, stereo, 44.1kHz, MP3 files, produced with the lame codec
 +
* Will contain ~5000 tracks
 +
* Selected from both the USPOP and USCRAP collections
 +
* No tracks shorter than 30 seconds
 +
* No tracks longer than 10 minutes
 +
* A maximum of 20 tracks per artist
 +
* A minimum of 50 tracks per labelled genre
 +
* Will contain the ~350 songs from the IMIRSEL cover songs collections (30 distinct pieces - ~10-12 versions of each)
 +
* Cover songs, USPOP and USCRAP files will all be handled in the exactly the same way (archival quality copy > 192k VBR MP3 > 22kHz WAV
 +
 
 +
== Evaluation Methodology ==
 +
Two distinct evaluations will be performed
 +
* Human Evaluation
 +
* Objective statistics derived from the distance matrix
 +
 
 +
Note that the [[2006:Audio Cover Song]] task will be using the same framework to run the computations. However, the evaluation metholody is very different.
 +
 
 +
=== Human Evaluation ===
 +
 
 +
The primary evaluation will involve subjective judgments by human evaluators of the retrieved sets using Stephen Downie's Evalutron 6000 (Final description of the Evalutron is pending).
 +
 
 +
* Evaluator question: Given a search based on track A, which of these two tracks (B or C) is a better result. ''(Note, there is still some question as to whether using binary relative comparisons is a viable approach when the amount of comparisons required is considerd)''
 +
* ~40 randomly selected queries,  5 results per query, 3 sets of eyes, ~10 participating labs
 +
* Higher number of queries preferred as IR research indicates variance is in queries
 +
* The cover songs and songs by the same artist as the query will be filtered out of each result list to avoid colouring an evaluators judgement (a cover song or song by the same artist in a result list is likely to reduce the relative ranking of other similar but independent songs - use of songs by the same artist may allow over-fitting to affect the results)
 +
* These numbers can change as we are extracting the full distance matrices
 +
* It might be possible for researchers to use this data for other types of system comparisons after MIREX 2006 results have been finalized.
 +
* Human evaluation to be designed and led by IMIRSEL
 +
* Human evaluators will be drawn from the participating labs (and any volunteers from IMIRSEL or on the MIREX lists)
  
== Moderators ==
+
=== Objective Statistics derived from the distance matrix ===
* Kris West (University of East Anglia, UK) - [mailto:kw@cmp.uea.ac.uk kw@cmp.uea.ac.uk]
 
* Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - [mailto:elias.pampalk@gmail.com elias.pampalk@gmail.com]
 
* Paul Lamere (Sun Microsystems Laboratories, USA) - [mailto:paul.lamere@sun.com paul.lamere@sun.com]
 
  
== Introduction ==
+
Statistics of each distance matrix will be calculated including:
Although the automatic extraction of genre and artist labels from audio are interesting tasks, I (KW) believe that they are often used to evaluate more general music similarity techniques that compare two songs based on their audio content.  These techniques are hard to evaluate directly, for example with listening tests, as it is not practical to have a human listener rank the similarities of even a small test collection for a number of queries, which might require many hours of listening. Therefore, We have begun discussion of other methods of evaluating music similarity techniques, such as the methods described in [http://gatekeeper.research.compaq.com/pub/compaq/CRL/publications/logan/icme2001_logan.pdf Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001)], where the most similar 5, 10 or 20 songs were retrieved and the average number of songs in the same genre, from the same artist and from the same album calculated and more practical methods of subjective evaluation of similarity estimators (i.e. evaluation of performance, rather than comparison of output to that of human annotators). This evaluation could be extended to multiple genres if data is available. I believe it is also important that we evaluate other characteristics of these algorithms, such as the descriptor extraction time, query time and memory footprint (which may indicate the applicability of a technique to an application).
 
  
This page serves as a summary of the discussions held on the AudioSim06 mailing list and will eventually hold a final evaluation proposal for MIREX 2006.
+
* Average % of Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Precision at 5, 10, 20 & 50
 +
* Average % of Genre matches in the top 5, 10, 20 & 50 results after artist filtering of results
 +
* Average % of available Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Recall at 5, 10, 20 & 50 (just normalising scores when less than 20 matches for an artist, album or genre are available in the database)
 +
* Normalised average distance between examples of the same Genre, Artist or Album
 +
* Always similar - Maximum # times a file was in the top 5, 10, 20 & 50 results
 +
* % File never similar (never in a top 5, 10, 20 & 50 result list)
 +
* % of  song triplets where triangular inequality holds
 +
* Plot of the "number of times similar curve" -  plot  of song number vs. number of times it appeared in a top 20 list with songs sorted according to number times it appeared in a top 20 list (to produce the curve). Systems with a sharp rise at the end of this plot have "hubs", while a long 'zero' tail shows many never similar results.
 +
* Ratio of the average artist distance to the average genre distance
  
== Important threads on the discussion list ==
+
=== Additional Data Reported ===
Contributors:
+
* Runtimes  - Where possible accurate runtimes will be recorded. The two call format allows separation of feature extraction/indexing runtimes from the final query runtimes.
* Kris West
 
* Paul Lamere
 
* Elias Pampalk
 
* Fabian M├╢rchen
 
* George Tzanetakis
 
* Dan Ellis
 
* Stephen Green
 
* Rebecca Fiebrink
 
* Mark Levy
 
* Hamish Allan
 
* Anders Meng
 
* Adam Lindsay
 
  
=== Context related issues in music similarity evaluation ===
+
== Submission Format ==
1) Is the notion of music similarity consistent between different humans/cultures/music education etc ?
+
A submission to the Audio Music Similarity and Retrieval evaluation is expected to follow the [[2006:Best Coding Practices for MIREX]] and must conform to the following for execution:
One thing I know for sure is that to all of you most pieces of folk music from the island of Crete would sound very similar whereas to people from Crete (including me) they sound completely unique.
 
  
2) Does it even make sense to speak of similarity as a one dimensional quantity ?
+
=== One Call Format ===
For example is the dance version of Carmina Burana more similar to a classical recording of Carmina Burana or to another dance piece using the same drum loop.
+
The one call format is appropriate for systems that perform all phases of the evaluation (typically features extraction, and evaluation) in one step.  A submission should be an executable program that takes 3 arguments:
 +
* path/to/fileContainingListOfAudioFiles - the path to the list of audio files (seen the format below)
 +
* path/to/cacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
 +
* path/to/output/DistMatrix - the file where the output distance matrix should be placed. The format is described below
  
3) Can similarity be context-independent ?
+
'''Example:'''
Similarity only makes sense relative to a particular context. Billie Holiday is very different from Ella Fitzgerald in a context of female jazz singer however might be perceived as very similar in a general context of female singers including Britney Spears and Anni DiFranco.
 
  
Each of us consider similarity in a very different manner. Consider the scenario of human ranking of playlists according to similarity. The ranking we would get out of this "non-guided" (flat prior) similarity evaluation would be a kind of average ranking, since each user ranks after his preferences: E.g. user 1 might rank after vocal similarity, while user 2 ranks after instrument similarity, and so on. Perhaps it would be beneficial for the end user of some fancy music retrieval system in the future either to find music based on a kind of  "average" similarity (which I guess a lot of people would be happy with) or perhaps be able to be select his/her dimension of similarity, say : "Vocal similarity, like bono....". Perhaps a multidimensional similarity evaluation will be possible next year, but would almost certainly have to involve the generation of ground-truth through subjecctive similarity judgements.
+
<pre>
  
The issue of 'context-dependent' similarity is not so hard to deal with. If I give you one track and ask for the most similar ones, there's no context.  But if I give you 20, the spread within the set defines the context, and the algorithms can try to infer it. So if those tracks all have Paul McCartney singing, or if they all use cellos, or if they all use a simple tonic-subdominant-dominant chord progression, it seems a well-formed problem to ask an algorithm to infer the correct aspect on which to perform matching.
+
doAudioSim "path/to/fileContainingListOfFilesInDB" "path/to/cacheDir" "path/to/output/DistMatrix"
  
Predicting human-generated playlists is one possible way to frame this. By trawling the web, or scraping people's iTunes databases, you can get a lot of sets of songs that people have lumped together for one reason or another, representative of the real spread of 'contexts' relevant to users.  Having algorithms attempt to complete the rest of a playlist given the 1st half is something you can measure, even if performance is bound to be low in absolute terms.  I'm not frightened of low absolute performance provided there is still measurable difference between different systems.
+
</pre>
  
Another observation worth making is that a relatively small dataset (e.g. ~5000 tracks), where both data and metadata come from a common source (e.g. a single record label), defines its own limited context as you can reasonably expect that the genre classifications (used to organise the collection) have been applied in a consistent manner without outliers. The likelihood of this being the case decreases as the collection size rises, as it would require more editors and more judgements to organise.
+
=== Two Call Format ===
 +
The two call format is appropriate for systems that break their processing into two phases (typically a feature extraction phase and an evaluation phase. The submission should consist of two programs:
  
==== Collection specific learning ====
+
* doFeatureExtraction - this takes two arguments:
 +
** path/to/fileContainingListOfAudioFiles - the path to the ldist of audio files (seen the format below)
 +
** path/to/cacheDir - a directory where the submission can place the output of the first stage
 +
* outputDistMatrix - this takes two arguments
 +
** path/to/cacheDir - a directory where the first stage has placed its output.
 +
** path/to/output/DistMatrix - the file where the output distance matrix should be placed.  The format is described below.
 +
'''Example''':
 +
<pre>
  
It appears that music similarity estimators can be roughly divided into three groups, based the types of data that they leverage: purely content-based, augmented with behavioural data (such as skipping behaviour, playlist co-occurence etc.), and trained (content-based with Collection specific parameters estimated from a labelled subset of the database or an independent database). Purely content-based estimators appear to be the default mode for this evaluation. However, evaluation of trained submissions should be possible. Assuming an evaluation database size of 4000+ examples, 1000 - 1500 examples could be held back for training. Examples should be selected from the database in the same proportions that they occur. It might also be interesting to evaluate two copies of trained algorithms, one trained on the subset from the database and another trained on a separate dataset (greater differences in performance based on the two training sets may indicate overfitting to the collection, while smaller differences may indicate better generalisation).
+
doFeatureExtraction "path/to/fileContainingListOfFilesOfAudioFiles" "path/to/cacheDir"
 +
outputDistMatrix "path/to/cacheDir" "path/to/output/DistMatrix
  
=== Types of evaluation ===
+
</pre>
There have been a number of papers describing similarity evaluation, including those by Whitman, Berenzweig, Ellis and Logan.  The methods used generally fall into the following buckets:
+
=== Matlab format ===
  
# Subjective precision via user tests
+
Matlab will also be supported in the form of functions in the following formats:
# Expert opinion (similar artist lists from music editors like All Music Guide)
 
# Playlist Co-occurrence
 
# User Collection Co-occurrence
 
# objective statistics based upon album, artist and genre labels. (TopN, average distance)
 
  
For a standard, annual evaluation like MIREX, the first four types of
+
==== Matlab One call format ====
evaluations seem problematic.
+
<pre>
 +
doMyMatlabAudioSim('path/to/fileContainingListOfAudioFiles','path/to/cacheDir','path/to/output/DistMatrix
 +
</pre>
  
1 - subjective precision - is very expensive to collect this data for alarge music collection, and would likely be unreliable unless many users were evaluated.
+
==== Matlab Two call format ====
 +
<pre>
 +
doMyMatlabFeatureExtraction('path/to/fileContainingListOfAudioFiles','path/to/cacheDir')
 +
doMyMatlabOutputDistMatrix('path/to/cacheDir','path/to/output/DistMatrix')
 +
</pre>
  
2 - Expert opinion - expert opinion  will usually rate similarity of artists but not songs. Also, not transitive, coldplay may sound like the Beatles, but no one ever says the beatles sound like coldplay.  This data generally only exists for popular artists (i.e. not for artists typically found in 'free' collections of music).
+
== File Formats ==
  
3,4 - Playlist Co-occurrence, User Collection co-occurrence - works for popular music, but usually not enough coverage for less popular music, generally not suitable for our test collections (such as magnatune or epitonic), since this music is not listened to by enough people.
+
=== Input File ===
  
=== Factors to evaluate ===
+
The input list file format will be of the form:
# feature extraction time
 
# distance computation time
 
# memory consumption durring feature extraction
 
# memeory consumption durring distance computation
 
  
objective statistics based upon genre (with artist filter), artist & album labels:
+
<pre>
# closest  1 (ratio of pieces in same genre/artist/album as query)
+
path/to/audio/file/000001.wav
# closest  5 (-"-)
+
path/to/audio/file/000002.wav
# closest 10 (-"-)
+
path/to/audio/file/000003.wav
# closest 20 (-"-)
+
...
# clustering performance (e.g. ratio of the average intra-genre distance to the average inter-genre distance, see below)
+
path/to/audio/file/00000N.wav
 +
</pre>
  
3 and 4 are difficult to evaluate: (KW)
+
=== Output File ===
These may be slightly problematic due to the very varied
+
The only output will be a '''distance''' matrix file in the following format:
formats in which algortihms are submitted. I'll see what I can come up with. Comments from smart Linux/Unix/Mac bods on how to do this sort of measurement (perhaps even with manual entry of PID) would be  welcomed. I haven't the faintest idea how to do it on Windows other than manually with the task manager.
 
  
=== Objective statistics based upon album, artist and genre labels proposal ===
 
Use methods described in Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs are retrieved and the average number of songs in the same genre, from the same artist and from the same album are calculated.
 
  
An additional evalaution of the clustering performance of an algorithm could be calculated as the ratio of the average intra-genre distance (within-class scatter or cohesion) to the average inter-genre distance (between class scatter or separation).  
+
<pre>
 +
Example distance matrix 0.1 (replace this line with your system name)
 +
1    path/to/audio/file/1.wav
 +
2    path/to/audio/file/2.wav
 +
3    path/to/audio/file/3.wav
 +
...
 +
N    path/to/audio/file/N.wav
 +
Q/R    1        2        3        ...        N
 +
1    0.0      1.241    0.2e-4    ...    0.4255934
 +
2    1.241    0.000    0.6264    ...    0.2356447
 +
3    50.2e-4  0.6264  0.0000    ...    0.3800000
 +
...    ...    ...      ...        ...    0.7172300
 +
5    0.42559  0.23567  0.38      ...    0.000
 +
</pre>
  
==== Justification for using artist and album labels ====
+
All distances should be zero or positive (0.0+) and should not be infinite or NaN. Values should be separated by 1 or more characters of whitespace.
At Sun Labs we've been auditioning a number of different similarity models. We've had some models that behaved in ways that are apparently similar to your 'spectrum histogram', in that they yield good objective scores (a high percentage of songs in the top 20 are of the proper genre, the average intra-genre distance is low compared to the overall average distance), but  when using the models for actual playlist generation or visualization we'd experience the similar 'space-time distortions'.  The 75% of the songs might be of the proper genre, but the other 25% would be way off.  We call them 'clunkers', songs that no human would say belongs in the playlist of similar songs.  Also, we'd see other similar problems, where the songs in the 75% of the proper genre were not really very similar.  A folk rock song would be 'near' a punk rock song, or a choral piece would be near a harpsichord piece. One way of reducing this problem would be to take into account the artist and album metadata in the evaluation.  Presumably  there are three enclosing clusters: at the coarsest level is a genre cluster, within this cluster we could expect to find multiple artist clusters, and within an artist cluster we may find multiple album clusters.  Now, if it turns out that the similarity classifier being evaluated really makes no distinction of nearness within the genre (e.g choral and harpsichord music can be 'near'), then the ratio of the  average artist distance to average genre distance will approach one (the artist cluster is as large as the genre cluster), but if this ratio is much less than one then we have small artist clusters within the genre cluster.  The harpsichord music has clustered together and separated it from the choral music performed by a different artist.  The album clustering can give another, finer-grained level (but I'm not convinced that this extra level is necessary).
 
  
To sum up then, the average artist distance compared to the average genre distance gives a notion of how well a similarity metric is working within a single genre.
+
=== Test harness ===
 +
The following java application loads a distance matrix in the file format above and checks it for any errors. An M2K DistanceMatrix Object is initialized from this data and written back to a file, whose name is created by appending '.copy' to the original filename. This copy can be used to visually check that all the distances were correctly interpreted. Source for the DistaceMatrix reader and representation is also enclosed.
  
==== Volatility of objective statistics and clustering metric ====
+
[[Media:TestDistanceMatrices.tar.gz]]
It has been suggested that the use of the mean average in the calculation of the above statistics is too volatile as it is influenced by outliers and we should therefore use the median average. Outliers easily sneak into any ground truth or can be caused by choosing a bad segment of the song. However it should be noted that if a participant selects a poor segment from a song that is indicative of the performance of the assumption that you can randomly select a representative sample from a piece of music (an intelligent segmentation or thumbnailing technique might have a significant advantage). However, if a poor query segment is selected for the evaluation, it will effect all submissions equally. If a particular submission handles a poor query better it should achieve a better evaluation score, rather than have that additional performance averaged out. I.e. if an algorithm doesn't produce as many outliers, that fact should represented in the evaluation score.  
 
  
The use of the median average will most likely improve all calculated ratios, but will reduce the difference between algorithms. A more useful alternative may be the trimmed mean (remove 1 - 2% of results from both ends of each distribution then calculate mean). It has also been commented that, in generated playlists, outliers can ruin the perception of the performance of the rest of the generated playlist and therefore the use of a metric which deliberately ignores outliers is highly inappropriate (this applies to both the median and trimmed mean).  
+
[[Media:TestDistanceMatrices.zip]]
  
Another interesting statistic maybe the difference between the mean and median statistics as a lower value should indicate that the algorithm handles outliers well (is less volatile), while a higher value will indicate more outliers in the distribution.
+
Please report any problems to Kris West (kw at cmp dot uea dot ac dot uk).
  
=== Subjective evaluation ===
+
== Example Submission ==
Here's another possibility for evaluation.  I notice that they are now around 60 people subscribed to this email list. I imagine that this will grow to at least 100 or so by the time the mirex 2006 evaluation begins.  With this many people, perhaps we can do some real human evalutions of submitted systems.  If we can get each person on this list to agree to evaluate 10 playlists and rate them on a scale of 1 to 5 in terms of how similar the songs are, then  we can get 1000 playlists evaluated.  If 50 systems are submitted, that means we can have 20 different people look at each playlist.  Here's what we could do:
+
Elias Pampalk has kindly contributed a simple example Matlab Audio Similarity (distance) calculator. This example calculates similarity based on Zero-crossing rates and outputs a DistanceMatrix in the specified format.
* Each system would be given a seed song and asked to generate a 'playlist' of the 10 most similar songs (all systems would receive the same seed song)
 
* The lists would be anonymized (song and artists removed from titles to prevent bias from the metadata)
 
* The playlists would be randomly distributed to the set of evaluators to be scored (a web interface would suffice).
 
* Evaluators assign a 1 to 5 rating to each of their ten assigned playlists
 
* The average score for each playlist would recorded as part of the evaluation
 
* The resulting playlists can be made available for browsing after the evaluation results are published.
 
  
Since music similarity is such a human quantity I think that it is important to have some user evaluation of similarty as part of MIREX.   Thoughts?
+
[[Media:PampalkZCRTestMatlab.zip]]
  
==== Alternatives to rating on a scale ====
+
Please report any problems to Elias (elias dot pampalk at gmail dot com)
  
A pairwise comparison approach ("Is playlist A better or worse than playlist B?") might be more appropriate than rating on a scale (where, for example, individuals might prefer or avoid the middle of the scale by nature, or where all ten of someone's playlists could be in reality very bad compared to the set of all generated playlists, but he or she would have no basis to judge them as such). A one-bit measurement (relevant or irrelevant) has really paid off in text retrieval.  I expect it would pay off in music retrieval as well.
 
  
A one bit measure of overall similarity may be tricky to come by. But there are plenty of one bit musical features that could be measured quickly and pretty consistently, and tracks with a sufficient proportion of these in common can reasonably be called similar. This is presumably how the folks at pandora.com manage to  process a new track in 20 seconds.  Could this approach work?
+
== Evaluation Background ==
 +
Numerous discussion have taken place on the AudioSim mailing listSome of the threads are summarized here: [[2006:Important threads for Audio Similarity]]
  
 
== Related Papers ==
 
== Related Papers ==
# [http://gatekeeper.research.compaq.com/pub/compaq/CRL/publications/logan/icme2001_logan.pdf Logan and Salomon (ICME 2001), '''A Music Similarity Function Based On Signal Analysis'''].<br>One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query. A must read!
+
# [http://gatekeeper.research.compaq.com/pub/compaq/CRL/publications/logan/icme2001_logan.pdf Logan and Salomon (ICME 2001), '''A Music Similarity Function Based On Signal Analysis'''].<br>One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query. A must read! You might also want to check out the [http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&f=G&l=50&d=PALL&S1=7031980&OS=7031980&RS=7031980 US patent 7,031,980] that was granted April 2006 (filed October 2001).
 
# [http://ismir2002.ismir.net/proceedings/02-FP05-2.pdf Aucouturier and Pachet (ISMIR 2002), '''Music Similarity Measures : WhatΓÇÖs the use?'''].<br>Similar in some ways to the work of Logan and Salomon. Evaluation includes percentage of retrieved songs in the same genre (for top 1, 5, 10, 20, 100 songs) and some cluster (genre) overlap measures. Excellent paper!
 
# [http://ismir2002.ismir.net/proceedings/02-FP05-2.pdf Aucouturier and Pachet (ISMIR 2002), '''Music Similarity Measures : WhatΓÇÖs the use?'''].<br>Similar in some ways to the work of Logan and Salomon. Evaluation includes percentage of retrieved songs in the same genre (for top 1, 5, 10, 20, 100 songs) and some cluster (genre) overlap measures. Excellent paper!
 
# [http://www.ee.columbia.edu/~dpwe/pubs/ismir02-quest.pdf Ellis, Whitman, Berenzweig, and Lawrence (ISMIR 2002), '''The Quest for Ground Truth in Music Similarity'''].<br>The MusicSeer survey is reported. (MusicSeer was a very clever way to get lots of users to rate artists by similarity.)
 
# [http://www.ee.columbia.edu/~dpwe/pubs/ismir02-quest.pdf Ellis, Whitman, Berenzweig, and Lawrence (ISMIR 2002), '''The Quest for Ground Truth in Music Similarity'''].<br>The MusicSeer survey is reported. (MusicSeer was a very clever way to get lots of users to rate artists by similarity.)
Line 135: Line 225:
  
 
* Kris West (University of East Anglia, UK) - [http://www.cmp.uea.ac.uk/Research/speechgroup/projects.php?project=7 homepage] [http://www.cmp.uea.ac.uk/Research/speechgroup/publications.php?terms=West publications]
 
* Kris West (University of East Anglia, UK) - [http://www.cmp.uea.ac.uk/Research/speechgroup/projects.php?project=7 homepage] [http://www.cmp.uea.ac.uk/Research/speechgroup/publications.php?terms=West publications]
* Elias Pampalk (Austrian Research Institute for Artificial Intelligence (OFAI)) - [http://www.ofai.at/~elias.pampalk/elias.html homepage] [http://www.ofai.at/~elias.pampalk/publications.html publications]
+
* Elias Pampalk (AIST, Japan) - [http://staff.aist.go.jp/elias.pampalk/ homepage]
 
* Paul Lamere (Sun Labs, Sun Microsystems) - [http://research.sun.com/projects/dashboard.php?id=153 Project overview]
 
* Paul Lamere (Sun Labs, Sun Microsystems) - [http://research.sun.com/projects/dashboard.php?id=153 Project overview]
 +
* Rebecca Fiebrink (McGill University, Montreal) - [http://www.music.mcgill.ca/~rebecca/ homepage]
 +
 +
== Moderators ==
 +
* Kris West (University of East Anglia, UK) - [mailto:kw@cmp.uea.ac.uk kw@cmp.uea.ac.uk]
 +
* Elias Pampalk (AIST, Japan) - [mailto:elias.pampalk@gmail.com elias.pampalk@gmail.com]
 +
* Paul Lamere (Sun Microsystems Laboratories, USA) - [mailto:paul.lamere@sun.com paul.lamere@sun.com]

Latest revision as of 13:13, 13 May 2010

Results

Results are on 2006:Audio Music Similarity and Retrieval Results page.

Introduction

As the size of digitial music collections grow, music similarity has an increasingly important role as an aid to music discovery. A music similarity system can help a music consumer find new music by finding the music that is most musically similar to specific query songs (or is nearest to songs that the consumer already likes). However, evaluating music similarity systems is inherently difficult since music similarity is a subjective characteristic. Some attempts have been made to evaluate similarity systems with some degree of success (See the #Related Papers). In general, due, to the lack of ground truth for music similarity, other evaluations such as genre and artist identification have been used as surrogates for music similarity.

This year, for the first time, we are attempting a large scale music similarity evaluation as part of MIREX. This evaluation will rely primarily on human judgement to rank the various submission. We will also collect various objective measures related to genre and artist clustering, with the hope that we will find some correlation between these objective measures and the human evaluations.

This page presents the Audio Music Similarity Evaluation, including the submission rules and formats. Additionally background information can be found here that should help explain some of the reasoning behind the approach taken in the evaluation.

Evaluation Summary:

  • We will be soliciting contribution to two distinct tracks: Audio Music Search & 2006:Audio Cover Song
  • The division between these two tracks will be emphasized in the evaluation results, although results will be directly comparable (all evaluations will be performed for both tracks).
  • The intention of the Music Audio Search track is to evaluate music similarity searches (A music search engine that takes a single song as a query), not playlist generation or music recommendation.
  • Any criteria may be used to implement the search although we are not really considering socio-cultural context or lyrics.
  • Any models, codebooks etc. must be trained in advance. No collection specific optimisations should be used.
  • Please avoid use of the [USPOP] and USCRAP collections in your training as they will form the test database. Please also avoid any other overlap with the test data that you can identify.

Evalutron 6000 Issues

Please go to the 2006:Evalutron6000 Issues page for detailed information about how the human evaluations component will be undertaken.

Important Dates

  • Before August 14th: test and then submit algorithms, start running computations on the DB.
  • Before August 21th: finish running all submissions.
  • After August 21st: start human evaluations.

(See also the MIREX 2006 2006:Main Page.)

All runs must be complete by August 21st in order to take part in the human evaluation. This means that the final submission deadline falls on August 14th at the absolute latest! No submission will be considered after this date (there will be absolutely no extensions to this deadline) and your submission must actually work by the 14th (no additional debugging time). Therefore, it is highly recommended that you submit before this date and ensure that you have run the DistanceMatrix test on the Wiki. Remember this is not a flexible deadline and you should submit long before it falls. (Announced by Kris on July 28)

Runtime (Computational Resources)

We have not really discussed runtimes this year. This year's competition database will contain ~5000 files and I suspect some systems may need quite a long while to process this. If your system is likely to take more than 36 hours to process this collection this should be clearly stated in your readme file and you should make its likely runtime known on the list now. Given these long runtimes it would also be preferable to make your run resumable, i.e. save your extracted feature files to the supplied cache dir location (or a sub-directory of it). Please include details of any directory structures to be created in the cache dir in your readme file and ensure you acknowledge any training/tuning data that you have used in the development of your submission. (Announced by Kris on July 28)

Outstanding Issues

There are still a few issues that need to be resolved before this task is finalized:

  • Binary relative judgements vs. Absolute - we've had much debate over this. Most researchers seem to prefer the binary relative approach, but that places an extreme burden on the evaluators.
  • We are awaiting the final description of the Evalutron

Packaging your Submission

  • Be sure that your submission follows the #Submission_Format outlined below.
  • Be sure that your submission accepts the proper #Input_File format
  • Be sure that your submission produces the proper #Output_File format, which can be tested with the DistanceMatrix test harness #Test_harness
  • Be sure to follow the [Best Coding Practices for MIREX]
  • Be sure to follow the 2006:MIREX 2006 Submission Instructions
  • In the README file that is included with your submission, please answer the following additional questions:
    • Was the submission trained or tuned?
    • What was the source of the data used?
    • Are you aware of any overlap between the data and either the USCRAP, USPOP or Cover song collections?
    • Approximately how long will the submission take to process ~5000 wav files?
    • Approximately how much scratch disk space will the submission need to store any feature/cache files?
  • Submit your system via the URL located at the bottom of 2006:MIREX 2006 Submission Instructions page


Note that the information that you place in the README file is extremely important in ensuring that your submission is evaluated properly.

The Evaluation Database

The specifications of the evaluation database will be as follows:

  • 22.05kHz, mono, 16bit, WAV files
  • The WAV files will be decoded from 192kbit Variable-bit-rate, stereo, 44.1kHz, MP3 files, produced with the lame codec
  • Will contain ~5000 tracks
  • Selected from both the USPOP and USCRAP collections
  • No tracks shorter than 30 seconds
  • No tracks longer than 10 minutes
  • A maximum of 20 tracks per artist
  • A minimum of 50 tracks per labelled genre
  • Will contain the ~350 songs from the IMIRSEL cover songs collections (30 distinct pieces - ~10-12 versions of each)
  • Cover songs, USPOP and USCRAP files will all be handled in the exactly the same way (archival quality copy > 192k VBR MP3 > 22kHz WAV

Evaluation Methodology

Two distinct evaluations will be performed

  • Human Evaluation
  • Objective statistics derived from the distance matrix

Note that the 2006:Audio Cover Song task will be using the same framework to run the computations. However, the evaluation metholody is very different.

Human Evaluation

The primary evaluation will involve subjective judgments by human evaluators of the retrieved sets using Stephen Downie's Evalutron 6000 (Final description of the Evalutron is pending).

  • Evaluator question: Given a search based on track A, which of these two tracks (B or C) is a better result. (Note, there is still some question as to whether using binary relative comparisons is a viable approach when the amount of comparisons required is considerd)
  • ~40 randomly selected queries, 5 results per query, 3 sets of eyes, ~10 participating labs
  • Higher number of queries preferred as IR research indicates variance is in queries
  • The cover songs and songs by the same artist as the query will be filtered out of each result list to avoid colouring an evaluators judgement (a cover song or song by the same artist in a result list is likely to reduce the relative ranking of other similar but independent songs - use of songs by the same artist may allow over-fitting to affect the results)
  • These numbers can change as we are extracting the full distance matrices
  • It might be possible for researchers to use this data for other types of system comparisons after MIREX 2006 results have been finalized.
  • Human evaluation to be designed and led by IMIRSEL
  • Human evaluators will be drawn from the participating labs (and any volunteers from IMIRSEL or on the MIREX lists)

Objective Statistics derived from the distance matrix

Statistics of each distance matrix will be calculated including:

  • Average % of Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Precision at 5, 10, 20 & 50
  • Average % of Genre matches in the top 5, 10, 20 & 50 results after artist filtering of results
  • Average % of available Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Recall at 5, 10, 20 & 50 (just normalising scores when less than 20 matches for an artist, album or genre are available in the database)
  • Normalised average distance between examples of the same Genre, Artist or Album
  • Always similar - Maximum # times a file was in the top 5, 10, 20 & 50 results
  •  % File never similar (never in a top 5, 10, 20 & 50 result list)
  •  % of song triplets where triangular inequality holds
  • Plot of the "number of times similar curve" - plot of song number vs. number of times it appeared in a top 20 list with songs sorted according to number times it appeared in a top 20 list (to produce the curve). Systems with a sharp rise at the end of this plot have "hubs", while a long 'zero' tail shows many never similar results.
  • Ratio of the average artist distance to the average genre distance

Additional Data Reported

  • Runtimes - Where possible accurate runtimes will be recorded. The two call format allows separation of feature extraction/indexing runtimes from the final query runtimes.

Submission Format

A submission to the Audio Music Similarity and Retrieval evaluation is expected to follow the 2006:Best Coding Practices for MIREX and must conform to the following for execution:

One Call Format

The one call format is appropriate for systems that perform all phases of the evaluation (typically features extraction, and evaluation) in one step. A submission should be an executable program that takes 3 arguments:

  • path/to/fileContainingListOfAudioFiles - the path to the list of audio files (seen the format below)
  • path/to/cacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
  • path/to/output/DistMatrix - the file where the output distance matrix should be placed. The format is described below

Example:


doAudioSim "path/to/fileContainingListOfFilesInDB" "path/to/cacheDir" "path/to/output/DistMatrix" 

Two Call Format

The two call format is appropriate for systems that break their processing into two phases (typically a feature extraction phase and an evaluation phase. The submission should consist of two programs:

  • doFeatureExtraction - this takes two arguments:
    • path/to/fileContainingListOfAudioFiles - the path to the ldist of audio files (seen the format below)
    • path/to/cacheDir - a directory where the submission can place the output of the first stage
  • outputDistMatrix - this takes two arguments
    • path/to/cacheDir - a directory where the first stage has placed its output.
    • path/to/output/DistMatrix - the file where the output distance matrix should be placed. The format is described below.

Example:


doFeatureExtraction "path/to/fileContainingListOfFilesOfAudioFiles" "path/to/cacheDir"
outputDistMatrix "path/to/cacheDir" "path/to/output/DistMatrix

Matlab format

Matlab will also be supported in the form of functions in the following formats:

Matlab One call format

doMyMatlabAudioSim('path/to/fileContainingListOfAudioFiles','path/to/cacheDir','path/to/output/DistMatrix

Matlab Two call format

doMyMatlabFeatureExtraction('path/to/fileContainingListOfAudioFiles','path/to/cacheDir')
doMyMatlabOutputDistMatrix('path/to/cacheDir','path/to/output/DistMatrix')

File Formats

Input File

The input list file format will be of the form:

path/to/audio/file/000001.wav
path/to/audio/file/000002.wav
path/to/audio/file/000003.wav
...
path/to/audio/file/00000N.wav

Output File

The only output will be a distance matrix file in the following format:


Example distance matrix 0.1 (replace this line with your system name)
1    path/to/audio/file/1.wav
2    path/to/audio/file/2.wav
3    path/to/audio/file/3.wav
...
N    path/to/audio/file/N.wav
Q/R    1        2        3        ...        N
1    0.0      1.241    0.2e-4     ...    0.4255934
2    1.241    0.000    0.6264     ...    0.2356447
3    50.2e-4  0.6264   0.0000     ...    0.3800000
...    ...    ...      ...        ...    0.7172300
5    0.42559  0.23567  0.38       ...    0.000

All distances should be zero or positive (0.0+) and should not be infinite or NaN. Values should be separated by 1 or more characters of whitespace.

Test harness

The following java application loads a distance matrix in the file format above and checks it for any errors. An M2K DistanceMatrix Object is initialized from this data and written back to a file, whose name is created by appending '.copy' to the original filename. This copy can be used to visually check that all the distances were correctly interpreted. Source for the DistaceMatrix reader and representation is also enclosed.

Media:TestDistanceMatrices.tar.gz

Media:TestDistanceMatrices.zip

Please report any problems to Kris West (kw at cmp dot uea dot ac dot uk).

Example Submission

Elias Pampalk has kindly contributed a simple example Matlab Audio Similarity (distance) calculator. This example calculates similarity based on Zero-crossing rates and outputs a DistanceMatrix in the specified format.

Media:PampalkZCRTestMatlab.zip

Please report any problems to Elias (elias dot pampalk at gmail dot com)


Evaluation Background

Numerous discussion have taken place on the AudioSim mailing list. Some of the threads are summarized here: 2006:Important threads for Audio Similarity

Related Papers

  1. Logan and Salomon (ICME 2001), A Music Similarity Function Based On Signal Analysis.
    One of the first papers on this topic. Reports a small scale listening test (2 users) which rate items in a playlists as similar or not similar to the query song. In addition automatic evaluation is reported: percentage of top 5, 10, 20 most similar songs in the same genre/artist/album as query. A must read! You might also want to check out the US patent 7,031,980 that was granted April 2006 (filed October 2001).
  2. Aucouturier and Pachet (ISMIR 2002), Music Similarity Measures : WhatΓÇÖs the use?.
    Similar in some ways to the work of Logan and Salomon. Evaluation includes percentage of retrieved songs in the same genre (for top 1, 5, 10, 20, 100 songs) and some cluster (genre) overlap measures. Excellent paper!
  3. Ellis, Whitman, Berenzweig, and Lawrence (ISMIR 2002), The Quest for Ground Truth in Music Similarity.
    The MusicSeer survey is reported. (MusicSeer was a very clever way to get lots of users to rate artists by similarity.)
  4. Berenzweig, Logan, Ellis, and Whitman (ISMIR 2003), A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures.
    Artist similarity measures are evaluated based on data from All Music Guide, from a survey (musicseer.com), and from playlists and personal collections.
  5. Logan, Ellis, and Berenzweig (SIGIR 2003), Toward Evaluation Techniques for Music Similarity.
    Evaluating artist similarity (similar to the ISMIR 2003 version).
  6. Pampalk, Dixon, and Widmer (DAFx 2003), On the Evaluation of Perceptual Similarity Measures for Music
    An attempt was made to compare similarity measures published by different authors. Artist, album, tones, style, and genre (the last three from AMG) were used for the evaluations. The average distance between all songs vs the average distance within a group (e.g. genre) was used as quality criteria.
  7. Aucouturier and Pachet (JNRSAS 2004), Timbre Similarity: How high is the sky?.
    Follow up to their ISMIR 2002 paper. Contains detailed results of experiments on the optimization of spectral similarity. Reports a glass ceiling. Excellent article!
  8. Pampalk, Flexer, and Widmer (ISMIR 2005), Improvements of Audio-based Music Similarity and Genre Classification.
    The need for an artist filter (ie, not having the same artists in the test and training set) is described in this paper.
  9. Vignoli and Pauws (ISMIR 2005), A Music Retrieval System Based on User-Driven Similarity and its Evaluation.
    User evaluation based on a playlist generation system (which partly uses audio-based similarity).

Opt-in survey of Audio music similarity researchers

In this section we would like to take a brief 'opt-in' survey of researchers actively working in this field. Please feel free to add yourself to the list (or email your details to the moderators listed above).

Moderators