Difference between revisions of "2005:Symbolic Melodic"

From MIREX Wiki
(Review 1)
Line 101: Line 101:
  
 
I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that we have no problems with RISM.
 
I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that we have no problems with RISM.
 +
 +
==Downie's Comments==
 +
 +
1. A classic retrieval task! I am game. Some hurdles to jump but I always thought RISM was a good place to start.
 +
 +
2. Do think Jeremy Pickens would have some good ideas to help this along.
 +
 +
3. I wonder if the Meldex folks would like to play?

Revision as of 00:14, 2 February 2005

Proposer

Rainer Typke (Universiteit Utrecht) rtypke@cs.uu.nl


Title

Retrieving melodically similar incipits from 500000 notated real world compositions


Description

Retrieving the most similar incipits from the RISM A/II collection, given one of the incipits as a query.

Expected results: For 11 queries, a ground truth has been established (see Typke et al. 2005, http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/TR_groundtruth.pdf ). This ground truth makes it possible to measure the search results in an absolute way, taking into consideration not only relevance, but also the correct ranking. This ground truth has been obtained by combining ranked lists that were created by 35 music experts. The resulting ground truth has the form of ranked groups of incipits; the groups contain incipits whose differences in rankings were not statistically significant, but the ranking of the groups is statistically significant. MIR systems that perform well should return incipits such that the members of the groups are retrieved without violating the order of the groups. By using this ground truth, an absolute measure of quality is attainable. However, there is a danger of overfitting algorithms to just the 11 queries for which a ground truth is known. Therefore, there should also be an evaluation based on another set of queries that are chosen only after the algorithms have been submitted. This second part of the evaluation would only allow a comparison, no absolute measurement, but any dependance of the algorithms on the queries could be avoided.


Potential Participants

Tuomas Eerola/Petri Toiviainen Jyväskylä, Finland Jürgen Kilian/Holger Hoos Darmstadt/British Columbia, Canada Shyamala Doraisamy/Stefan Rueger Malaysia/London Maarten Grachten/Josep Lluis Arcos/Ramón López de Mántaras Bellaterra, Spain Giovanna Neve/Nicola Orio Pado Padova, Italia Anna Pienimäki/Kjell Lemström Helsinki, Finland Craig Sapp/Yi Wen Liu/Eleanor Selfridge-Field Stanford, US Daniel Müllensiefen/Klaus Frieler Hamburg Anna Lubiw/Luke Tanur Waterloo, Canada Michael Clausen Bonn Alexandra Uitdenbogerd, RMIT Rainer Typke, Frans Wiering, Remco C. Veltkamp, Universiteit Utrecht, {rainer.typke|frans.wiering|remcov}@cs.uu.nl Jeremy Pickens - jeremy@dcs.kcl.ac.uk Tim Crawford - t.crawford@gold.ac.uk

The likelihood of entering is unknown for all groups except Utrecht and Goldsmiths/King's; for Utrecht and Goldsmiths/King's, it is high. The other groups have developed algorithms that would be interesting to compare, so their likelihood of entering is not too low.


Evaluation Procedures

1) Absolute measurement using the ground truth For each of the 11 queries from the ground truth, we have a certain number of groups of incipits, where we know the correct ranking of the groups. Let us assume that the total number of incipits in these groups is N. For each ranking returned by a method to be evaluated, take the top N incipits. At every border between groups, calculate the precision. To get one single measure, one can now integrate over this precision curve and divide the result by N. This measure will be a number between 0 and 1, where 0 means total failure and 1 means complete agreement with the ground truth. This measure will reflect not only the ability of algorithms to find as many relevant incipits as possible, but also to order them correctly.

For example, let us assume that the ground truth has delivered the following three groups, with N=6: (a,b),(c,d,e),(f).

Method X returns: b,a, g,c,d, e.

Since there are 3 groups, we determine precision and recall at three points:

after position 2: precision 1 (because both relevant documents a and b have been found. The correct order of a and b is not known since they are in the same group, therefore no penalty should be applied for them being in the opposite order).

after position 5: precision 4/5=0.8 (because only g should not be among the first 5 incipits)

after position 6: precision 5/6=0.83 (because only g should not be among the first 6)

For an illustration, see the last 3 slides of http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/dir05pres.pdf

2) Comparison of algorithms without using the ground truth To avoid the overfitting of the algorithms to the ground truth, select another group of queries after all algorithms have been submitted. Put all incipits returned by the algorithms into one pool. Remove doubles from the pool, divide the remaining incipits into portions of equal sizes, and distribute them back to the participants for relevance judgements (on a binary or ternary scale). Once the relevance is known for all returned incipits, use standard TREC measures for comparing the algorithms. This second comparison has some disadvantages: the complete set of relevant incipits won't be known, and neither will be the correct ranking. But the main disadvantage of the first comparison, the fact that the correct answer is known in advance, is avoided.


Relevant Test Collections

RISM A/II collection (large collection of real-world compositions for which a ground truth has already been established). Possible copyright issue: Permission from RISM for distributing the test collection or parts of it to participants and using it for EvalFest is probably necessary. Obtaining the data is not difficult: the RISM CD is available in good libraries, and the data can be exported from the CD in plaine&easie format. Alternatively, the existing result of converting the data into MusicXML could be distributed. If RISM does not agree to the data being distributed, maybe it is possible to convince them to at least agree to a solution similar to the Naxos audio collection, where the data are quarantained, and only the software travels.

Review 1

Melodic similarity would have to be precisely defined, i.e. how was the ground truth established.

The mentioned possible participants are really likely to participate (to my knowledge at least 10 out of the 14 listed).

Part 1 - Absolute measurement using ground truth Can the number of queries be increased? 11 queries for a collection of 50000 incipits seem rather small (This comment can be ignored if there is difficulty in establishing the ground truth from such a large collection). Part 2 - Comparison of algorithms without using the ground truth Any ideas of how queries would be obtained ? Real-world queries ? Queries pooled from each participant ?

The relevance assumptions have to be explicitly stated for the judging process by the various participants using this pooling approach. If ternary scale is used, details of this scale would be needed. What evaluation measure would be adopted from TREC?

Review 2

A very well- thought out proposal. First off, he has a database in mind, a large list of participants and thought carefully about evaluation. However, the number of likely participants is probably not as high as he believes and there are still issues with evaluation.

I'm not totally clear about some aspects. These 500000 notated real world compositions- are they audio files or midi? How long are they and are they mono or polyphonic? How is the notation done? Judging from most of the names of participants, I'd guess he's talking about midi.

His set of 11 queries with groundtruth is actually quite small, and there is still considerable work involved in getting groundtruth for other queries, especially since it requires feedback from many music experts.

His method of evaluation through use of a groundtruth established by music experts should be compared with Jeremy's suggestion (check previous MIREX postings or just communicate with him directly) of using variations to identify melodic similarity.

His section on evaluation is quite good, and precision/recall is the preferred method in the Info Retrieval community (just ask Stephen). Both algorithms seem feasible assuming the other issues have been worked out. If possible, both his suggested evaluation procedures should be implemented. Would be interesting to see if they give similar results, and if not, why? If the RISM A/II collection is really available from many libraries, then most of the work is done. Stephen's group can keep it quarantined and just run the algorithms that are provided. I think we don't even need to ask RISM for permission.

I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that we have no problems with RISM.

Downie's Comments

1. A classic retrieval task! I am game. Some hurdles to jump but I always thought RISM was a good place to start.

2. Do think Jeremy Pickens would have some good ideas to help this along.

3. I wonder if the Meldex folks would like to play?