2005:Audio Melody Extr
Graham Poliner (Columbia University) email@example.com
Melody Extraction of Polyphonic Audio
The melodic content of polyphonic audio provides an intuitive representation for summarization and retrieval. Numerous potential approaches exist for automated melody extraction; therefore, the MIREX 2005 Melody Extraction Evaluation seeks to compare the accuracy of state-of-the-art melody transcription algorithms. The evaluation data set will consist of an eclectic collection of audio excerpts along with the corresponding frame-based transcription of the dominant voice. The performance of the submitted algorithms will be evaluated based on the percentage of frames correctly transcribed.
- Juan P. Bello - firstname.lastname@example.org - Very Likely
- Ali Taylan Cemgil - email@example.com - Moderately Likely
- Emilia Gomez - firstname.lastname@example.org - Moderately Likely
- Masataka Goto - email@example.com - Moderately Likely
- Jana Eggink - firstname.lastname@example.org - Moderately Likely
- Anssi Klapuri - email@example.com - Moderately Likely
- Matija Marolt - firstname.lastname@example.org - Likely
- Rui Pedro Paiva - email@example.com - Very Likely
- Graham Poliner - firstname.lastname@example.org - Very Likely
- Sven Tappert - email@example.com - Very Likely
- Karin Dressler - firstname.lastname@example.org - Likely
- Matti Ryyn├ñnen - email@example.com - Moderately Likely
- Emmanuel Vincent - firstname.lastname@example.org - Likely
Following the evaluation procedure specified for the ISMIR 2004 Melody Contest
- Option 1 - A frame-based comparison between the predicted and reference melody
The total prediction accuracy may be computed by calculating the average absolute difference for each frame where a maximal error is defined as one semitone = 100 cents and a value of 0 Hz may be assigned to unvoiced segments.
- Option 2 - A frame-based comparison between the predicted and reference melody over a one-octave range
This option is the same as Option 1; however, the predicted melody and reference melody are mapped into the range of one octave before calculating the absolute difference.
- Option 3 - Edit distance between the estimated melody and the correct melody
Following the edit distance calculation outlined in Grachten et al. 2002
Relevant Test Collections
For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse set of audio segments and corresponding melody transcriptions. Due to the success of the ISMIR 2004 Melody Competition, we recommend that the evaluation set be reused and augmented with additional audio excerpts from such genres as pop, jazz, digital, and opera. The new ground truth may be created by manually correcting the output of current melody transcription algorithms. We may also wish to consider representing the genres in different proportions for the MIREX 2005 evaluation. The inclusion of popular music may result in additional copyright issues. Copyright law prohibits the universal or unlimited distribution of material on the web. However, if access to the media is limited to MIREX participants, this should be considered a fair use of the copyrighted materials.
Problem is reasonably well defined and would be considered interesting in terms of current research.
No mention of audio format/sampling rate, will assume:
- CD-quality (CM, 16-bit, 44100 Hz)
- 30 seconds excerpts
- files are named as "001.wav" to "999.wav"
No mention of frame size or hop size, will this be the same as 2004 competition (Frame size 2048, hop size 256)? Is this optimal? Would some participants prefer to use different sizes. Could the proposed evaluation metrics be modified to use absolute time indexes and a tolerance and therefore be independent of framing?
In the proposed evaluation metrics there is no mention of whether option 1 and option two will be averages as they were last year, or how option 3 will be combined with these. Statistical significance of differences between submissions should be estimated.
Re-use and augmentation of last year's database is fine, however there is no mention of where new data will come from. Obviously the Magnatune database would be a good source, as this can also be distributed, however it may be best to distribute last years database and hold back new examples. How big should new database be? 50 files? I assume there are likely to be no trained submissions, or they will be pre-trained therefore a single pass over the data should be fine. There is also no mention of how many non-participating transcribers will produce the ground-truth and how differences in transcriptions will be resolved. Given IP status of Magnatune database, distribution to transcribers should not be a problem.
Given the high number of potential participants, I think we can be confident of sufficient participation to run the evaluation.
Recommendation: Significant refinements to proposal and accept.
This problem is well defined and very relevant to MIR.
The mentioned possible participants are really working in the field. However, the participants marked as "very likely" the same people that participated last year, while some key researchers in the field are modestly marked as "moderately likely". I believe that for this evaluation to be meaningful, the organizers should secure the participation of Masataka Goto (whose PreFest algorithm is still the main reference for melody extraction), Matija Marolt, Jana Eggink (both of whom published relevant work last year) and Anssi Klapuri (who has an extensive research record on relevant issues). Also, apart from Ali Taylan Cemgil, some of the people working in more Bayesian-based approaches to relevant problems are not mentioned: Chris Raphael (Indiana U), Samer Abdallah (Queen Mary, London), Randall Leistikow (Stanford U), Kunio Kashino (NTT Japan). It could be very interesting to have them on board.
Regarding evaluation procedures, this contest has the advantage of having a precedent during last year's exercise. I would make a few suggestions from that experience:
- UPF should make available any semi-automatic tool for evaluation used last year.
- Each sound file to be used, should be cross-annotated, and the variability between annotations should be used for the evaluation.
- 2 or more voice arrangements should be eliminated from the training/test set. In those there is no clear definition of the melody to be extracted.
- There should be a separate evaluation for melody segmentation: how well the algorithm separates those excerpts containing melodic parts from those that are purely background. The evaluation can be similar to the one Marolt's paper for DAFx04.
I would recommend the organizers to contact Emilia Gomez, Sebastian Strecht and Bee-Suan Ong from UPF, about last year's experience. We should learn from that experience and improve where necessary.
Using the RWC database, Magnatunes and other similar collections, could help to expand the training and test sets. The organizers will need to coordinate a wide effort to expand on the currently existing contest database. Melody annotation is very complex and quite time-consuming, so only through a concerted effort will a proper test set be developed. The organizers could also contact Michele Lessaffre in Ghent, about their annotations efforts in the past (see ISMIR 2004).
1. The reviewers have summed up the issues very well. This is a hard task to evaluate completely and well. Can we come up with a "baby" version that we can do now while aiming toward a richer evaluation down the road?
As a potential participant, I have two comments.
- How can we measure the performance of an algorithm regarding fine identification of f0 if the target f0 is created with another algorithm? This is not a ground truth ! I would better use the following error for option 1: error is equal to 0 whenever the predicted f0 is within 1/4 tone of the reference f0, and error is equal to 1 otherwise. This also solves the frame size issue, since the reference f0 may vary slightly depending on the frame size but not the discrete pitch. Another possibility would be to consider prediction of discrete (MIDI) pitch, which is sufficient for MIR applications and relevant as soon as all excerpts have the same reference pitch of 440 Hz (no ancient music then). Discrete events are needed anyway to compute the edit distance, aren't they ? (please insert a http link to the article describing the calculation of this distance)
- The distinction between voiced/unvoiced (melody/accompaniment) segments is not very clear: in my opinion when the main melody is silent for a while, you hear another melody inside accompaniment. Last year melody was defined using training data from the same musical excerpts as test data, but this is not a good idea since it may lead to learn data-specific melody characteristics. I would like to use excerpts containing only clearly voiced portions and/or to define melody by its pitch range ("if the dominant pitch is between A and B then it is part of the melody"), so that no training set is needed to define melody.
There should be an option to use different hop/frame sizes. Maybe a preferred size could be given (i.e. the one used for ground truth), while for others, ground truth data could be interpolated to fit any hop size (loss of accuracy is at the risk of submitter)
Last year's data should be augmented with some new data; next to mentioned sources, RWC is a useful source, as MIDI transcriptions are also available (although not aligned) and may provide a starting point for annotation. UPF's tool would certainly be useful. Are there any score-to-audio alignment tools available?
I agree that we could have several evaluations:
- f0 without taking into consideration unvoiced/accompaniment parts, thereby ignoring algorithm's capability of separating melody from other parts (considering and ignoring octave errors) and emphasizing f0 detection
- f0 as last year (considering and ignoring octave errors)
- melody segmentation, as proposed by reviewer 2, but this would also mean that ground truth should include accompaniment, which is probably not realistic
- edit distance ?
If ground truth f0 is not estimated accurately enough, then some discretization scheme similar to Emmanuel's suggestions would be appropriate, but I disagree with just MIDI pitches, as they are too coarse, especially with vocal parts.
About annotating -nothing different from what was done last year:
- get multi-track recordings and produce a monophonic wav file for the melody (say the voice track or the sax track) and a polyphonic wav file with the full mixture (a lot of people on this area have recordings of their own music or have access to recordings of other's people music, and should be willing to distribute 30s segments of songs for research purposes).
- Run the monophonic track through a monophonic pitch estimator and use the results as a reference for people to "correct"/edit as they consider necessary. We could use some tools for that like UPF's SMStool.
About improving last year's test set:
- we need to eliminate ambiguous cases like the Beatles song (pop1) where vocal arrangements create more than one possible melody at times.
- We need cross-annotations: even using the above mentioned method for annotation, you cannot consider that data to be ground-truth. The proper thing will be to have a few people annotate the same music, so that we can use the variability of their annotations as a guide for comparison. There are two main things to be annotated: a) the pitch of the melody during melodic segments; b) the temporal boundaries of the melodic segments. We can measure variability of a) in Hz (or fractions of tones) and variability of b) in seconds.
- I agree with Emmanuel on that having a F0-deviation error is not the best choice given the nature of the ground-truth. For every frame we should consider hit or miss depending on the detection being within a tolerance window around the annotated value or not. This tolerance window could be fixed (e.g. a 1/4 tone) or could be variable (derived from the variability between annotations, as suggested by Levau for the case of onsets - ISMIR 04). I'll be in favour of the second.
- I am not entirely sure about last year's 3rd evaluation (segmenting to MIDI values), mainly because for a lot of music melody is better described as a time-varying curve rather than as a sequence of segmented notes (e.g. for voice, sax, etc). In fact, when I did the experiments last year, "segmented" melodies sounded awful most of the time (except for fairly artificial cases like the midi files or one of the daisy samples).
- I think we could separately evaluate the temporal segmentation of melodic tracks (sections of the song containing melody). It is a problem of its own that deserves attention. It'll be interesting to see which approach gives you the best segmentation as opposed to which one gives you the best pitch estimation. For this we don't need annotation of accompaniment as segments will be classified simply as melodic or non-melodic.
Emilia (UPF)'s comments
Just making some additional comments from last year's experience:
1.- "UPF should make available any semi-automatic tool for evaluation used last year". Regarding the annotation tools used last year, there was wavesurfer for melodic annotations: http://www.speech.kth.se/wavesurfer/ It is free software and may be used. Also SMSTools for fundamental frequency annotations, plus some manual corrections.
2.- On different voices: this can be one of the main difficulties in melody extraction, to determine which is the "predominant" melody. Maybe this problem can be solved with cross-annotations.
3.- On melody vs predominant pitch: the ability to find the melody was measured in the evaluation metric number 3. In last's year competition, we found that most of the approaches only output a pitch envelop, not a melody. That takes us to the question: is it pitch = melody?
Should we then consider the contest as "predominant pitch estimation" instead of "melody extraction"?. Interesting and unanswered question ... :-)
4.- Related to pitch vs quantized pitch: it is true that we use already an algorithm for generating a ground truth + manual corrections. It was agreed by participatns that quantized pitch would be too coarse. I think that, although there might be a metric related to quantized pitch, there should be an evaluation metric considering expressive variations of the pitch.