Difference between revisions of "2006:Symbolic Melodic Similarity"
IMIRSELBot (talk | contribs) m (Robot: Automated text replacement (-\[\[([A-Z][^:]+)\]\] +2006:\1)) |
|||
(9 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
+ | ==Results== | ||
+ | |||
+ | Results are on [[2006:Symbolic Melodic Similarity Results]] page. | ||
+ | |||
== Overview == | == Overview == | ||
Line 80: | Line 84: | ||
give us an idea of whether we're progressing or just submitting different | give us an idea of whether we're progressing or just submitting different | ||
algorithms each year that do about the same. | algorithms each year that do about the same. | ||
+ | |||
+ | |||
+ | ===Rainer Typke=== | ||
+ | What do you mean with "using as a baseline"? A replacement for the ground truth? I don't think that would work all that well with n-grams because this year, we are working with hummed queries as well as exact ones (half the queries will have perfect rhythm and pitch, like last year, but the other half will be hummed; note that by having "perfect rhythm and pitch", I don't mean that they are guaranteed to be error-free - for the non-RISM tasks, they are still entered by a human user). For the hummed and therefore distorted queries, n-grams would fail unless you do some quantization first, which could introduce errors. | ||
+ | |||
+ | So, since the nature of the task changed a bit, I am not sure if knowing how we perform compared to one particular algorithm will tell us much. That reference algorithm might just be better at last year's task than this year's, which might lead us to believe that we got better even if we didn't. | ||
+ | |||
+ | To really compare the results with last year, IMO, we would have to let all algorithms perform the same task as last year in addition to this year's harder tasks. This might actually be sensible since to my knowledge, last year's evaluation collection and queries were never made public, so the new algorithms can't be easily over-trained for last year's task. | ||
+ | |||
+ | I am copying these last two bits of discussion to the mailing list, to make sure that people who don't monitor this page can participate in the discussion. | ||
+ | |||
+ | === Alexandra Uitdenbogerd === | ||
+ | I certainly don't mean a replacement for ground truth. | ||
+ | <p> | ||
+ | I agree that it may not work with hummed queries. I haven't tried it. | ||
+ | My n-gram technique was tested on melodies with errors, so may be reasonably robust, but the errors were not transcribed singing type errors. </p> | ||
+ | <p> | ||
+ | However, on reading this year's requirements, it wouldn't be possible to just run last year's code on this year's set. </p> | ||
+ | <p> | ||
+ | Some may argue that their algorithms are designed for the new task and not the original one (or vice versa), so may object to running them on both collections. I would be happy for mine to run on both, out of curiosity.</p> |
Latest revision as of 12:06, 13 May 2010
Results
Results are on 2006:Symbolic Melodic Similarity Results page.
Overview
This page is devoted to discussions of The MIREX06 Symbolic Melodic Similarity contest. Discussions on the MIREX 06 Symbolic Melodic Similarity contest planning list will be briefly digested on this page. A full digest of the discussions is available to subscribers from the MIREX 06 Symbolic Melodic Similarity contest planning list archives.
Task suggestion: Symbolic Melodic Similarity
Proposed tasks
1. Retrieve the most similar incipits from the UK subset of the RISM A/II collection (about 15,000 incipits), given one of the incipits as a query, and rank them by melodic similarity. Both the query and the collection are monophonic. Half the queries are hummed or whistled queries that have been converted to MIDI, thus with slight rhythmic and pitch imperfections, and half the queries are quantized in pitch and rhythm.
2. Like task 1, but with two collections of mostly polyphonic MIDI files to be searched for matches. The queries would still be monophonic. The first collection would be 10,000 randomly picked MIDI files from a collection of about 60,000 MIDI files that were harvested from the Web. They include different genres (Western and Asian popular music, classical music, ringtones, just to name a few). The second collection would be more focused: about 1000 .kar files (Karaoke MIDI files) with mostly Western popular music which stem from the same web harvest.
Inputs/Outputs
Task 1: Input: Parameters: - the name of a directory containing about 15,000 MIDI files containing mostly monophonic incipits and - the name of one MIDI file containing a monophonic query.
The program will be called 6 times. Three of the queries are going to be quantized (produced from symbolic notation) and three produced by humming or whistling, thus with slight rhythmic and pitch deviations.
Expected Output: - a list of the names of the 10 most similar matching MIDI files, ordered by melodic similarity. Write the file name in separate lines, without empty lines in between.
Task 2:
Input: same interface as for task 1, thus the name of the directory with files to be searched and the name of the query. However, the directory will contain either about 10,000 mostly polyphonic MIDI files or 1000 Karaoke files.
Output: a list of the names of 10 different MIDI file names that contain melodically similar musical material, ordered by similarity, plus for each file the time (offset from the beginning in seconds) where the query matches and where the matching bit ends. If the query matches in more than one position, return the position of the most similar match (or any one of them if there is more than one most similar match). If the algorithm does not align the query with the MIDI file at any particular position, just return 0 as start time and the duration of the MIDI file as end time.
Sample output line:
somefile.mid,0,2.3
(means that somefile.mid matches the query, and the matching bit starts at the very beginning of the file and ends 2.3 seconds later). The most similar match should be returned first.
Building the ground truth
Unlike last year, it is now nearly impossible to manually build a proper ground truth in advance.
Because of that, after the algorithms have been submitted, their results are going to be pooled for every query, and every participant is going to be asked to judge the relevance of the matches for some queries. To make that a manageable burden, it is important that the algorithms do not only return the names of the matching MIDI files for task 2, but also where the matching bit starts and ends in the matching MIDI file. We can then automatically extract those matching bits and put them into small new MIDI files whose relevance can then be quickly checked.
Measures
Use the same measures as [last year] to compare the search results of the various algorithms.
Potential participants
The following people have confirmed their interest:
- Klaus Frieler
- Nicola Orio
- Kjell Lemstr├╢m
- Rainer Typke
- David Meredith
- Alexandra Uitdenbogerd
Nicola Orio
- Pooling among participants of the first M results of the retrieved files (as in the proposal), with a binary relevance, and if not binary at least quantized in a small number of classes (like 0-5 or 0-3).
- The ability to identify the part of the melody to be presented to the evaluator can be done (and make sense) only for local alignment approaches, not for approaches on more general properties of the melodies. I'm afraid this difference in presenting the retrieval results will bias the assessments, and I suggest to use complete melodies for the pooling of the results.
Kjell Lemstr├╢m
- When evaluating the time that algorithms use for the task, the time that is spent by system should be excluded from algorithm running times (this was not the case last year).
Rainer Typke and Anna Pienimäki
- About pooling, we think that is more reasonable to use procedures similar to last year when deciding on the amount of similarity between the melodies for technical, music philosphical and comparability reasons.
- We think that when evaluating the melodies, the complete melodies should be made available but also matching snippets (if the algorithm can provide them). Asking the evaluators to judge up to 5 minute songs instead of relevant snippets is not practical. For instance, if we have n = 10 returned items per participant per query and m = 12 queries and the average length of the song is k = 3 minutes and p = 4 participating algorithms, it takes m*n*k*p = 10*12*3*4 = 1440 minutes to just listen through all the candidate songs. With 10 sec snippets, we would end up with an hour and 20 minutes, which is more reasonable.
- Algorithm running time: If the IMIRSEL team can make it available, this information would be interesting:
- time spent by the system
- time spent by the algorithm for indexing (if there is indexing)
- time spent by the algorithm for searching
Alexandra Uitdenbogerd
Can I suggest that the simple n-gram method (modulo interval 5-gram coordinate matching) we submitted last year be used as a baseline for this and the other symbolic query track. I intentionally submitted our simplest effective technique for this purpose last year. Despite being the simplest method submitted it was ranked 3rd out of 7 entries. I will probably be submitting something else this year, but would really like to see the n-gram technique used for continuity across MIREX years. It should give us an idea of whether we're progressing or just submitting different algorithms each year that do about the same.
Rainer Typke
What do you mean with "using as a baseline"? A replacement for the ground truth? I don't think that would work all that well with n-grams because this year, we are working with hummed queries as well as exact ones (half the queries will have perfect rhythm and pitch, like last year, but the other half will be hummed; note that by having "perfect rhythm and pitch", I don't mean that they are guaranteed to be error-free - for the non-RISM tasks, they are still entered by a human user). For the hummed and therefore distorted queries, n-grams would fail unless you do some quantization first, which could introduce errors.
So, since the nature of the task changed a bit, I am not sure if knowing how we perform compared to one particular algorithm will tell us much. That reference algorithm might just be better at last year's task than this year's, which might lead us to believe that we got better even if we didn't.
To really compare the results with last year, IMO, we would have to let all algorithms perform the same task as last year in addition to this year's harder tasks. This might actually be sensible since to my knowledge, last year's evaluation collection and queries were never made public, so the new algorithms can't be easily over-trained for last year's task.
I am copying these last two bits of discussion to the mailing list, to make sure that people who don't monitor this page can participate in the discussion.
Alexandra Uitdenbogerd
I certainly don't mean a replacement for ground truth.
I agree that it may not work with hummed queries. I haven't tried it. My n-gram technique was tested on melodies with errors, so may be reasonably robust, but the errors were not transcribed singing type errors.
However, on reading this year's requirements, it wouldn't be possible to just run last year's code on this year's set.
Some may argue that their algorithms are designed for the new task and not the original one (or vice versa), so may object to running them on both collections. I would be happy for mine to run on both, out of curiosity.