2008:Audio Melody Extraction
[this page is for now a pale copy/paste of MIREX06 webpage: Audio_Melody_Extraction]
Contents
- 1 Goal
- 2 Description
- 3 Potential Participants
- 4 JL's Comments 11/07/08
- 5 Vishu's comments 14/07/08
- 6 JL's Comments 15/07/08
- 7 Vishu's comments : Multi-track Audio available 22/07/08
- 8 Karin's comments 22/07/08
- 9 JL's Comments 30/07/08
- 10 Vishu's Comments 04/08/08
- 11 JL's Comments 05/08/08
- 12 Mert`s Comments 05/08/08
- 13 Vishu's Comments 07/08/08
- 14 JL's Comments 11/08/08
- 15 Vishu's comments 12/08/08
Goal
To extract the melody line from polyphonic audio.
The deadline for this task is AUGUST 22nd.
Description
The aim of the MIREX audio melody extraction evaluation is to identify the melody pitch contour from polyphonic musical audio. The task consists of two parts: Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not), and pitch detection (deciding the most likely melody pitch for each time frame). We structure the submission to allow these parts to be done independently, i.e. it is possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced. Algorithms which don't perform a discrimination between melodic and non-melodic parts are also welcome!
(The audio melody extraction evaluation will be essentially a re-run of last years contest i.e. the same test data is used.)
Dataset:
- MIREX05 database : 25 phrase excerpts of 10-40 sec from the following genres: Rock, R&B, Pop, Jazz, Solo classical piano
- ISMIR04 database : 20 excerpts of about 20s each
- CD-quality (PCM, 16-bit, 44100 Hz)
- single channel (mono)
- manually annotated reference data (10 ms time grid)
Output Format:
- In order to allow for generalization among potential approaches (i.e. frame size, hop size, etc), submitted algorithms should output pitch estimates, in Hz, at discrete instants in time
- so the output file successively contains the time stamp [space or tab] the corresponding frequency value [new line]
- the time grid of the reference file is 10 ms, yet the submission may use a different time grid as output (for example 5.8 ms)
- Instants which are identified unvoiced (there is no dominant melody) can either be scored as 0 Hz or as a negative pitch value. If negative pitch values are given the statistics for Raw Pitch Accuracy and Raw Chroma Accuracy may be improved.
Relevant Test Collections
- For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. (full test set with the reference transcriptions (28.6 MB))
- Graham's collection: you find the test set here and further explanations on the pages http://www.ee.columbia.edu/~graham/mirex_melody/ and http://labrosa.ee.columbia.edu/projects/melody/
Potential Participants
- Jean-Louis Durrieu (TELECOM ParisTech, formerly ENST), durrieu@enst.fr
- Pablo Cancela (pcancela@gmail.com)
- Vishweshwara Rao (Indian Institute of Technology), vishu_rao@iitb.ac.in
- Karin Dressler (kadressler@gmail.com)
- Matti Ryynänen and Anssi Klapuri (Tampere University of Technology), matti.ryynanen <at> tut.fi, anssi.klapuri <at> tut.fi
- Chuan Cao and Ming Li (ThinkIT Lab., IOA), ccao <at> hccl.ioa.ac.cn, mli <at> hccl.ioa.ac.cn
JL's Comments 11/07/08
We propose to re-run the Audio Melody Extraction task this year. It was dropped last year, but since 2006, there were probably other research on this topic. Anyone interested ?
Vishu's comments 14/07/08
May I also suggest that we additionally have a separate evaluation for cases where the main melody is carried by the human singing voice as opposed to other musical instruments? I ask this for two reasons, the first being that for most popular music the melody is indeed carreid by the human voice. And the second reason is that, while our predominant F0 detector is quite generic, our voicing detector is 'tuned' to the human voice and so less likely to perform well for other instruments.
JL's Comments 15/07/08
Concerning the vocal/non-vocal distinction: this has been done in previous evaluations of audio melody extraction (see https://www.music-ir.org/mirex/2006/index.php/Audio_Melody_Extraction_Results for the results of the MIREX06 task). I guess separated results for vocal and vocal+non-vocal should be possible once again.
I had another concern: does anyone know of some extra corpus ? It could be nice to have some more material to test the algorithms. Maybe some more classical excerpts? Does anyone know a way to obtain such data, I mean, with separated track of the main melody so that the work can be half-way done by some automatic algorithm?
Vishu's comments : Multi-track Audio available 22/07/08
We are in possession of about 4 min 15 sec of Indian classical vocal performances with separated tracks of the main melody. For a 10 ms hop, there are about 21000 vocal frames. Would this data be of interest?
Karin's comments 22/07/08
Hi Vishu and others! Any new data is appreciated - and a classical Indian performance would definitely add an interesting new genre :-) I have only made minor changes to my own melody extraction algorithm since I have shifted my priorities to midi note estimation (onset/offset and tone height) of the melody voice. Anyway, I am interested in a new evalutation of my algorithm. I know that the ISMIR 2004 dataset has annotated midi notes available. Maybe we could also evaluate the extracted midi melody notes - at least for this data set! Is there anyone else interested in this evaluation?
JL's Comments 30/07/08
Hi everyone!
A few comments...
To Vishu: could you upload anything to mert? I would also like to know how you annotated the data. The people who did the groundtruth for ISMIR2004 (E. Gomez in particular) told me that they used 46.44ms long windows (for 44.1kHz sampling rate, that s 2048 samples, hence the "strange" number), with 5.8ms hopsize. This groundtruth has been modified by Andreas (Ehmann) such that the hopsize became 10ms in MIREX05. 
The groundtruth for both collections give as first column the time stamp of the _center_ of the window (at least, that s what they did for ISMIR04), and as the second column the corresponding frequency in Hz.
To Karin: It s nice to see former participants coming again risking their algorithms on the same task! I think that s also rather important for further studies: that way, we can directly compare ourselves to the state of the art!
Vishu's Comments 04/08/08
Sorry for the delay but I was travelling for a bit. I just uploaded our data to Mert. The ground truth format is the same as MIREX05, except that instead of every 10ms we generate Ground truth values every 10.023ms. This is because our data is sampled at 22.05 kHz and 10ms corressponds to 220.5 samples at that sampling frequency. So this had to be rounded off to a hop of 221 samples (10.023ms).
Regarding the window size for ground truth generation, for each of the four excerpts we used a window length that results in a main lobe width that is reliably able to resolve adjacent harmonics of the lowest expected F0 (known apriori) for that excerpt.
JL's Comments 05/08/08
Hi Vishu, hi all !
I was wondering if it would be possible to include some of our test set in the development set, so that we know what it is about. Maybe some excerpts of 30s each? Do you think that would be feasible?
I am not sure about what you say for the window length... Could you be more precise? I was lately struggling a little bit with the multiF0 dev set, which led me to notice that the groundtruth sequences for the instruments were not completely aligned... I think for the sake of comparison that we should opt for a given window length that all the participants will use. In ISMIR04, that window length was 46.44 ms, which gave 2048 samples @ 44100Hz. This value seems reasonable to me, even though it might look rather long for our purpose (the pitch can evolve rather fast and even during 50ms, one can "see" this effect on the spectrogram of a _small_ chirp, where the lobes of the higher peaks - in frequency - are wider than those of the lower peaks). Most of the groundtruth was generated with windows this size, so I guess it would make more sense if everyone used this size. It might of course not be optimal in some ways, especially if one uses other representations (CQ transform for instance), in which case the participant would be penalized, even if that could lead to better results. Anyway! what do you all think about having one window size for all the participants?
A last thing (for today :D), maybe we should convert all the files to the same sampling rate, for the sake of simplicity? of course one can do it online, with matlab's (bad) resample function. That, again, is about to compare the systems and just them: one should get rid of the potential processings needed (like the resampling step). Should we convene of a specific sampling rate for all the songs?
Mert`s Comments 05/08/08
Hi everyone thanks for writing your comments. JL we appreciate your data set also. The deadline for this task will be August 22nd. Yes Vishu uploaded the data. It consists of human singing a background instrument and a percussive instrument. I`ll reinterpolate the ground truth to match the 10ms hop size and also upsample it to 44100 khz. I can also recreate the ground truth using yin/wavesurfer/praat to have an 10ms hopsize 46ms window at 44100 khz if you want.
Vishu's Comments 07/08/08
Hi JL and others!
As far as I understand, the window length and alignment of ground truth values
are independant. The alignment would depend on the hop size and nothing
else. 
Regarding the window length, ideally for the ground truth computation the
shortest possible window around an analysis time instant should be used in
order to be robust to fast pitch modulations. The best option is to have a
pitch-adaptive window. I would think that this would make your
ground-truth all the more 'truthful' (Especially since the ground truth
computation is also making use of some PDAs (YIN, PRAAT etc.). If this is
the case then I do not think it would be fair to impose a standard
window-length on all participants, since this might negatively affect their
algorithm performance.
For the ground-truth values for our Indian music dataset, we have used
shorter windows (23 ms) for female singers and longer (46 ms) windows for
male singers. This reduces the effect of the faster (Hz/sec) modulations
of the female singers, since they generally have higher pitch. 
However, if the ground-truth values themselves are being extracted using
some fixed analysis window length (eg. 46 ms) then I think it would be in
the participants' best interests to use the same window length for their
analysis.
JL's Comments 11/08/08
the window length and alignment of ground truth values are independant. The alignment would depend on the hop size and nothing else.:
Once you have the hopsize, I agree that the alignment is straightforward... but only given a certain offset that, according to me, depends on the window size - that s really just a matter of aligning the first window. At least, that is relevant to the way we are annotating the groundtruth in MIREX, if I understand well.
For your database, does it mean that the time at the center of the first window, for the female sung excerpts, is 11.5ms, while it is 23ms for the male sung ones? I guess we just need to know that so that we can evaluate accordingly.
I would say the difference in window lengths for the male and female excerpts first helps to have a better resolution in frequency, the "tracking" ability of the groundtruth being more related to the hopsize you choose. As I understand it, what you mean is that the approximation saying that the pitch is constant within one analysis window is less false if the windows are small. I guess we just need a trade-off (a window size of 46ms seems right to me, but 23ms aint bad either!) between this approximation and the precision in estimating the f_0 in the window.
I think we can think of 2 types of eventual scenarii for the analysis windows: one in which the f_0 is constant, and the other one for which the f_0 varies. For the first one, I'd say, no problem to annotate. For the second type, I would say the most "human" way of annotating it would be to choose the "mean" of the fundamental frequencies that are present. I may be talking about silly things here, sounds a little bit stupid, but I was wondering whether other people had been thinking about that... If we wanted to annotate correctly those frames, we should give the instantaneous frequency, with the associated instantaneous time, and also give the slope (say the first order derivative of the instantaneous frequency), which is what people sometimes want to estimate. Giving only one f_0 transforms the problem: check the opera excerpts from the ISMIR04 database, with their deep vibrato, on "transition" frames, the FFT is clearly different from a perfect "spectral comb", as it would be the case with a constant f_0. Defining the f_0 for such frames as the maximum of the first lobe (first harmonic) may seem natural, but that is yet another convention.
Another interesting point with those opera excerpts: some of the high frequency components on male performance have variations almost as fast as the ones for the female performances. That means that even if the fundamental frequencies for the male performers do not evolve as fast as for the female performers, their log_2 variations actually are quite close to the latter ones. And since the evaluation criteria are based on the musical scale, that remark has its importance, I think.
But again, for our purpose, I guess the way the annotations has been done is more than sufficient! And new data for evaluation is always welcome! Thank you again for your effort!
Vishu's comments 12/08/08
Hi all! 
I apologise if I was not clear enough before. When I said that "The alignment would depend on the hop size and nothing else.", I assumed that the center of the first analysis window is at 0 sec. This means that irrespective of window length, the window centers would always be at the same time instants i.e. 0, 10ms, 20ms...
On observing the ground truth files for the ISMIR 2004 testing dataset and the MIREX 2005 training dataset, this seems to be the convention they too follow since the time-stamp of the first ground truth value is 0 sec, which should correspond to the center of the first analysis window. This is the convention that we have followed for our Indian music data. Hope this clarifies things.

