Difference between revisions of "2009:Audio Melody Extraction"

Revision as of 06:08, 5 September 2009

Description

The text of this section is copied from the 2008 page. Please add your comments and discussions for 2009.

The aim of the MIREX audio melody extraction evaluation is to identify the melody pitch contour from polyphonic musical audio. The task consists of two parts: Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not), and pitch detection (deciding the most likely melody pitch for each time frame). We structure the submission to allow these parts to be done independently, i.e. it is possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced. Algorithms which don't perform a discrimination between melodic and non-melodic parts are also welcome!

(The audio melody extraction evaluation will be essentially a re-run of last years contest i.e. the same test data is used.)

Discussions for 2009

Your comments here.

New evaluations for 2009?

We would like to know if there would be potential participants for this year's evaluation on Audio Melody Extraction.

There has also been an interest last year in evaluating the results at note levels (and not at a frame by frame level), following the multipitch evaluation. However, it has not been done, probably because of both a lack of participants and of database. Would there be more people this year?

cheers, Jean-Louis, 9th July 2009

Chao-Ling's Comments 14/07/2009

Hi everyone. I would like to suggest that we have a separate evaluation on the songs where the main melody is carried by the human singing voice as opposed to other musical instruments (like Vishu's comment in MIREX2008). We proposed a pitch extraction approach for singing voices and may not be likely to perform well for other instruments.

In addition, we have prepared a dataset called MIR-1K and would like to add it as part of the training/evaluation dataset. It contains 1000 song clips recorded at 16 kHz sample rate with 16-bit resolution. The duration of each clip ranges from 4 to 13 seconds, and the total length of the dataset is 133 minutes. These clips were extracted from 110 karaoke songs which contain a mixed track and a music accompaniment track.

Vishu's Comments 20/07/2009

Hi Everyone.

Chao-Ling, your dataset sounds exciting. I think the community would benefit greatly from the addition of such a large database. Last year we too had contributed some data (Indian classical music). Here are some points with respect to your current dataset to make it conform to previous data formats.

We need to have mixed (voice + accompaniment) mono tracks, so if you would like to mix it yourself (based on some SNR considerations) or if you think just giving equal weight to left and right channels of your files is acceptable, let us know which.
I see that your ground-truth's (.pv files) are in semitones. Since all the previous reference pitch values are in Hz either you could convert them to Hz and pass them on or pass on the exact conversion formula (Hz to cents) you have used to the evaluators (IMMERSEL).
Finally, the previous ground-truth pitch files were in two column format (TimeStamp_sec PitchValue_Hz) available every 10 ms. Your files are single-column. Please let us know at exactly which time-instant is your first window centered and the conversion can be done accordingly.

Thanks again for your efforts.

Chao-Ling's Comments 21/07/2009

Hi Vishu and everybody! Thank you for your suggestions. My responses are as follows:

The left and right channels were adjusted to have equal weight. I prefer to provide both channels because they are good for evaluating algorithms at different SNR settings.
I will provide pitch ground-truth in Hz.
The first window is centered in 20ms (the window size is 40ms and the overlap is 20ms). Note that the last window is discarded if its time is less than 40ms, so some .pv files might have one less point.

I will provide new .pv files that have both ground-truth in Hz and time stamp column. I will also make this dataset smaller that discards the parts that are unrelated to this task.

Chao-Ling's Comments 22/07/2009

Okay guys, here is the new files of the dataset: MIR-1K for MIREX. Plesea feel free to let me know if there are problems.

About MIR-1K - Jean-Louis, 26/07/2009

Hi all!

Such a big database indeed is good news for the relevance of the evaluation. However, in the spirit of MIREX (if any), it may have been good to keep some part of it "hidden" to the participants, so as to perform evaluation on a test database, on which no one could have tuned their algorithms.

Do you, by any chance, have another 1000 songs that we could use for that purpose? :p Well, otherwise, that still makes it a good database for evaluation and comparison.

By the way, I have trouble checking the above mentioned rar archive file: on my ubuntu, it says the archive type is not supported. Any idea?

One last thing: is anyone interested in evaluating note-wise transcription (as in the multi-f0 evaluation task)? If so, is there any annotation of that type for MIR-1K?

Chao-Ling's Comments 27/07/2009

Hi Jean-Louis and everyone!

Unfortunately, I donΓÇÖt have another dataset. Even if I do, it is not ΓÇ£hiddenΓÇ¥ from me :S.

The rar file can be extracted in ubuntu by this program: winrar for Linux. However, MIR-1K does not contain the annotation for evaluating note-wise transcription.

Andreas Ehmann's Comments 12/08/2009

Hi guys! We are quietly ramping up for this year's MIREX. The dataset is quite exciting. Although it's not 'hidden' I think it's more than useable. ADC04 isn't quite withheld either. I guess my main concern from a logistics point of view is that it is pretty big! Some of the melody algorithms are on the slow side, so crunching through that many minutes of audio might tie up our machines quite a bit. So I guess our options are try and make sure the algorithms are fast enough, or we can maybe choose a subset of the 1000 to evaluate against.

Cheers! -Andreas

Jean-Louis' Comments 17/08/2009

Hi everyone, It feels like andreas' comment on the speed of some algorithms was sort of referring to Pablo's and my program from last year. I can't however guarantee that this year's algorithms will be any faster...

I guess working on subsets of the database is a good option. Maybe a few 100 snippets from it. Chao-Ling: is the database homogeneous, such that one can grab randomly any excerpt and get a representative dataset, or is there a smart way of choosing these excerpts?

Andreas Ehmann's Comments 18/08/2009

Hey gang,

I think I am going to sample rate convert the MIR-1K dataset to 44.1kHz (from the 16kHz it is now). Naturally we will have dead space in the spectrum, but I am already envisioning systems having hard coded (in samples) frame lengths and hops. Sound reasonable? That or everyone has to ensure they are robust to multiple SR's. It's easy to do on my end though, and that way everything will be 44.1kHz.

Vishu's comments 19/08/2009

Hi all.

Andreas, wrt to our specific entry/entries, they will be SR independent. So 16 kHz or 44.1 kHz doesn't really matter. In the interest of data homogeneity however, maybe 44.1 kHz is preferable.

I have a question regarding deadlines. Are we following the Sept. 8 deadline, as posted on the MIREX 2009 homepage, or do you think we could push this up a bit. I ask this because we intend to submit two algorithms this year and the second one may not be ready by Sept 8.

Chao-Ling's Comments 02/09/2009

Hi Jean-Louis and all,

Sorry for my late reply. I worked very hard to build a "hidden" dataset for this task. The length of the dataset is around 167 minutes (374 clips with length 20~40 secs each). The dataset was built in the same way as MIR-1K with the same format (16kHz,16bits). I would like to know how do we evaluate our algorithms with this dataset and how do I provide it to the committee?

Vishu's comments 03/09/2009

Chao-Ling: Last year we had contributed an Indian classical music dataset. I had corresponding with Mert Bay (mertbay@gmail.com) at that time.

Chao-Ling's Comments 03/09/2009

Thx Vishu. Andreas Ehmann will set up a dropbox account for me to upload the dataset. Besides, I would like to know what SNR value should be used to mix the singing voice and accompaniment for the evaluation. Any suggestion?

Vishu's comments 03/09/2009

Chao-Ling: From my point of view, it would be useful to divide the dataset into two halves. One with an audibly acceptable SNR (between 5 and 10 dB) and the other with a lower, and therefore tougher, SNR (0 dB). Of course, a lot also depends on the nature of the accompaniment i.e. lower SNR on simple (solo) accompaniment, like a single flute, may not provide as much of a challenge as a relatively higher SNR but with more complex accompaniment, like rock music. Note that I use the terms 'simple' and 'complex' purely from a point of view of signal complexity. Since you are most familiar with the data, you could divide it as you see fit.

Morten's comments 05/09/2009

Chao-Ling and Vishu: I think that it would be preferable if a part of the dataset is mixed both at an "audibly acceptable" SNR and at a "toughter" SNR. It would be difficult to conclude anything if the two mixing levels was used on two different datasets - does the results then depend on the mixing level or on the dataset?

Dataset

MIREX05 database : 25 phrase excerpts of 10-40 sec from the following genres: Rock, R&B, Pop, Jazz, Solo classical piano
ISMIR04 database : 20 excerpts of about 20s each
CD-quality (PCM, 16-bit, 44100 Hz)
single channel (mono)
manually annotated reference data (10 ms time grid)

Output Format

In order to allow for generalization among potential approaches (i.e. frame size, hop size, etc), submitted algorithms should output pitch estimates, in Hz, at discrete instants in time
so the output file successively contains the time stamp [space or tab] the corresponding frequency value [new line]
the time grid of the reference file is 10 ms, yet the submission may use a different time grid as output (for example 5.8 ms)
Instants which are identified unvoiced (there is no dominant melody) can either be scored as 0 Hz or as a negative pitch value. If negative pitch values are given the statistics for Raw Pitch Accuracy and Raw Chroma Accuracy may be improved.

Relevant Test Collections

For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. (full test set with the reference transcriptions (28.6 MB))
Graham's collection: you find the test set here and further explanations on the pages http://www.ee.columbia.edu/~graham/mirex_melody/ and http://labrosa.ee.columbia.edu/projects/melody/

Potential Participants

Vishweshwara Rao & Preeti Rao (Indian Institute of Technology Bombay, India)
Jean-Louis Durrieu, Ga├½l Richard and Bertrand David (Institut T├⌐l├⌐com, T├⌐l├⌐com ParisTech, CNRS LTCI, Paris, France)
Chao-Ling Leon Hsu, Jyh-Shing Roger Jang, and Liang-Yu Davidson Chen (Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan)
Morten Wendelboe (Institute of Computer Science, Copenhagen University, Denmark)
Sihyun Joo & Seokhwan Jo & Chang D. Yoo (Korea Advanced Institute of Science and Technology, Daejeon, Korea)
Pablo Cancela (pcancela@gmail.com) Montevideo, Uruguay
Hideyuku Tachibana, Takuma Ono, Nobutaka Ono, and Shigeki Sagayama (The University of Tokyo, Japan)

@@ Line 106: / Line 106: @@
 === Vishu's comments 03/09/2009 ===
 Chao-Ling: From my point of view, it would be useful to divide the dataset into two halves. One with an audibly acceptable SNR (between 5 and 10 dB) and the other with a lower, and therefore tougher, SNR (0 dB). Of course, a lot also depends on the nature of the accompaniment i.e. lower SNR on simple (solo) accompaniment, like a single flute, may not provide as much of a challenge as a relatively higher SNR but with more complex accompaniment, like rock music. Note that I use the terms 'simple' and 'complex' purely from a point of view of signal complexity. Since you are most familiar with the data, you could divide it as you see fit.
+=== Morten's comments 05/09/2009 ===
+Chao-Ling and Vishu: I think that it would be preferable if a part of the dataset is mixed both at an "audibly acceptable" SNR and at a "toughter" SNR. It would be difficult to conclude anything if the two mixing levels was used on two different datasets - does the results then depend on the mixing level or on the dataset?
 == '''Dataset''' ==

Difference between revisions of "2009:Audio Melody Extraction"

Revision as of 06:08, 5 September 2009

Contents

Description

Discussions for 2009

New evaluations for 2009?

Chao-Ling's Comments 14/07/2009

Vishu's Comments 20/07/2009

Chao-Ling's Comments 21/07/2009

Chao-Ling's Comments 22/07/2009

About MIR-1K - Jean-Louis, 26/07/2009

Chao-Ling's Comments 27/07/2009

Andreas Ehmann's Comments 12/08/2009

Jean-Louis' Comments 17/08/2009

Andreas Ehmann's Comments 18/08/2009

Vishu's comments 19/08/2009

Chao-Ling's Comments 02/09/2009

Vishu's comments 03/09/2009

Chao-Ling's Comments 03/09/2009

Vishu's comments 03/09/2009

Morten's comments 05/09/2009

Dataset

Output Format

Relevant Test Collections

Potential Participants

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools