2005:Audio Melody Extr

Proposer

Graham Poliner (Columbia University) graham@ee.columbia.edu

Title

Melody Extraction of Polyphonic Audio

Description

The melodic content of polyphonic audio provides an intuitive representation for summarization and retrieval. Numerous potential approaches exist for automated melody extraction; therefore, the MIREX 2005 Melody Extraction Evaluation seeks to compare the accuracy of state-of-the-art melody transcription algorithms. The evaluation data set will consist of an eclectic collection of audio excerpts along with the corresponding frame-based transcription of the dominant voice. The performance of the submitted algorithms will be evaluated based on the percentage of frames correctly transcribed.

Potential Participants

Juan P. Bello - juan.bello-correa@elec.qmul.ac.uk - Very Likely
Ali Taylan Cemgil - cemgil@science.uva.nl - Moderately Likely
Emilia Gomez - emilia.gomez@iua.upf.es - Likely
Masataka Goto - m.goto@aist.go.jp - Moderately Likely
Jana Eggink - j.eggink@dcs.shef.ac.uk - Moderately Likely
Anssi Klapuri - klap@cs.tut.fi - Moderately Likely
Matija Marolt - matija.marolt@fri.uni-lj.si - Likely
Rui Pedro Paiva - ruipedro@dei.uc.pt - Very Likely
Graham Poliner - graham@ee.columbia.edu - Very Likely
Sven Tappert - s_tappert@yahoo.de - Very Likely
Karin Dressler - dresslkn@idmt.fraunhofer.de - Likely
Matti Ryyn├ñnen - matti.ryynanen@tut.fi - Moderately Likely
Emmanuel Vincent - emmanuel.vincent@elec.qmul.ac.uk - Likely

Evaluation Procedures

Following the evaluation procedure specified for the ISMIR 2004 Melody Contest

Option 1 - A frame-based comparison between the predicted and reference melody

The total prediction accuracy may be computed by calculating the average absolute difference for each frame where a maximal error is defined as one semitone = 100 cents and a value of 0 Hz may be assigned to unvoiced segments.

Option 2 - A frame-based comparison between the predicted and reference melody over a one-octave range

This option is the same as Option 1; however, the predicted melody and reference melody are mapped into the range of one octave before calculating the absolute difference.

Option 3 - Edit distance between the estimated melody and the correct melody

Following the edit distance calculation outlined in Grachten et al. 2002

Relevant Test Collections

For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse set of audio segments and corresponding melody transcriptions. Due to the success of the ISMIR 2004 Melody Competition, we recommend that the evaluation set be reused and augmented with additional audio excerpts from such genres as pop, jazz, digital, and opera. The new ground truth may be created by manually correcting the output of current melody transcription algorithms. We may also wish to consider representing the genres in different proportions for the MIREX 2005 evaluation. The inclusion of popular music may result in additional copyright issues. Copyright law prohibits the universal or unlimited distribution of material on the web. However, if access to the media is limited to MIREX participants, this should be considered a fair use of the copyrighted materials.

Review 1

Problem is reasonably well defined and would be considered interesting in terms of current research.

No mention of audio format/sampling rate, will assume:

CD-quality (CM, 16-bit, 44100 Hz)
mono
30 seconds excerpts
files are named as "001.wav" to "999.wav"

No mention of frame size or hop size, will this be the same as 2004 competition (Frame size 2048, hop size 256)? Is this optimal? Would some participants prefer to use different sizes. Could the proposed evaluation metrics be modified to use absolute time indexes and a tolerance and therefore be independent of framing?

In the proposed evaluation metrics there is no mention of whether option 1 and option two will be averages as they were last year, or how option 3 will be combined with these. Statistical significance of differences between submissions should be estimated.

Re-use and augmentation of last year's database is fine, however there is no mention of where new data will come from. Obviously the Magnatune database would be a good source, as this can also be distributed, however it may be best to distribute last years database and hold back new examples. How big should new database be? 50 files? I assume there are likely to be no trained submissions, or they will be pre-trained therefore a single pass over the data should be fine. There is also no mention of how many non-participating transcribers will produce the ground-truth and how differences in transcriptions will be resolved. Given IP status of Magnatune database, distribution to transcribers should not be a problem.

Given the high number of potential participants, I think we can be confident of sufficient participation to run the evaluation.

Recommendation: Significant refinements to proposal and accept.

Review 2

This problem is well defined and very relevant to MIR.

The mentioned possible participants are really working in the field. However, the participants marked as "very likely" the same people that participated last year, while some key researchers in the field are modestly marked as "moderately likely". I believe that for this evaluation to be meaningful, the organizers should secure the participation of Masataka Goto (whose PreFest algorithm is still the main reference for melody extraction), Matija Marolt, Jana Eggink (both of whom published relevant work last year) and Anssi Klapuri (who has an extensive research record on relevant issues). Also, apart from Ali Taylan Cemgil, some of the people working in more Bayesian-based approaches to relevant problems are not mentioned: Chris Raphael (Indiana U), Samer Abdallah (Queen Mary, London), Randall Leistikow (Stanford U), Kunio Kashino (NTT Japan). It could be very interesting to have them on board.

Regarding evaluation procedures, this contest has the advantage of having a precedent during last year's exercise. I would make a few suggestions from that experience:

UPF should make available any semi-automatic tool for evaluation used last year.
Each sound file to be used, should be cross-annotated, and the variability between annotations should be used for the evaluation.
2 or more voice arrangements should be eliminated from the training/test set. In those there is no clear definition of the melody to be extracted.
There should be a separate evaluation for melody segmentation: how well the algorithm separates those excerpts containing melodic parts from those that are purely background. The evaluation can be similar to the one Marolt's paper for DAFx04.

I would recommend the organizers to contact Emilia Gomez, Sebastian Strecht and Bee-Suan Ong from UPF, about last year's experience. We should learn from that experience and improve where necessary.

Using the RWC database, Magnatunes and other similar collections, could help to expand the training and test sets. The organizers will need to coordinate a wide effort to expand on the currently existing contest database. Melody annotation is very complex and quite time-consuming, so only through a concerted effort will a proper test set be developed. The organizers could also contact Michele Lessaffre in Ghent, about their annotations efforts in the past (see ISMIR 2004).

Downie's Comments

1. The reviewers have summed up the issues very well. This is a hard task to evaluate completely and well. Can we come up with a "baby" version that we can do now while aiming toward a richer evaluation down the road?

Emmanuel's Comments

As a potential participant, I have two comments.

How can we measure the performance of an algorithm regarding fine identification of f0 if the target f0 is created with another algorithm ? This is not a ground truth ! I would better use the following error for option 1: error is equal to 0 whenever the predicted f0 is within 1/4 tone of the reference f0, and error is equal to 1 otherwise. This also solves the frame size issue, since the reference f0 may vary slightly depending on the frame size but not the discrete pitch. Another possibility would be to consider prediction of discrete (MIDI) pitch, which is sufficient for MIR applications and relevant as soon as all excerpts have the same reference pitch of 440 Hz (no ancient music then). Discrete events are needed anyway to compute the edit distance, aren't they ? (please insert a http link to the article describing the calculation of this distance)

The distinction between voiced/unvoiced (melody/accompaniment) segments is not very clear: in my opinion when the main melody is silent for a while, you hear another melody inside accompaniment. Last year melody was defined using training data from the same musical excerpts as test data, but this is not a good idea since it may lead to learn data-specific melody characteristics. I would like to use excerpts containing only clearly voiced portions and/or to define melody by its pitch range ("if the dominant pitch is between A and B then it is part of the melody"), so that no training set is needed to define melody.

Matija's Comments

Some comments:

There should be an option to use different hop/frame sizes. Maybe a preferred size could be given (i.e. the one used for ground truth), while for others, ground truth data could be interpolated to fit any hop size (loss of accuracy is at the risk of submitter)

Last year's data should be augmented with some new data; next to mentioned sources, RWC is a useful source, as MIDI transcriptions are also available (although not aligned) and may provide a starting point for annotation. UPF's tool would certainly be useful. Are there any score-to-audio alignment tools available?

I agree that we could have several evaluations:

f0 without taking into consideration unvoiced/accompaniment parts, thereby ignoring algorithm's capability of separating melody from other parts (considering and ignoring octave errors) and emphasizing f0 detection
f0 as last year (considering and ignoring octave errors)
melody segmentation, as proposed by reviewer 2, but this would also mean that ground truth should include accompaniment, which is probably not realistic
edit distance ?

If ground truth f0 is not estimated accurately enough, then some discretization scheme similar to Emmanuel's suggestions would be appropriate, but I disagree with just MIDI pitches, as they are too coarse, especially with vocal parts.

2005:Audio Melody Extr

Contents

Proposer

Title

Description

Potential Participants

Evaluation Procedures

Relevant Test Collections

Review 1

Review 2

Downie's Comments

Emmanuel's Comments

Matija's Comments

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools