2007:Multiple Fundamental Frequency Estimation & Tracking

From MIREX Wiki
Revision as of 15:24, 16 February 2007 by MertBay (talk | contribs) (chunghsin yeh)

Description

A complex music signal can be represented by the F0`s contours of its constituent sources which is very useful in most music information retrieval systems. There have been many attempts in multi-F0 estimation, and related area melody extraction. The goal of multiple F0 tracking is to extract contours of each source from a complex music signal. In this task we would like to evaluate the state-of-art multi-F0 tracking algorithms. Since F0 tracking of all sources in a complex audio mixture can be very hard, we have to restrict our problem space. The possible cases are:

1. Multiple instruments active at the same time but each playing monophonically (one note at a time) and each instrument having a different timbre in a single channel input.

2. Multiple sources each playing polyphonically (e.g. chords…) in a single channel input.

3. Multiple sources each playing polyphonically in a stereo panned mixture.

We are more interested in the more general but feasible first case. The third case, which is subset of first case should be considered as a subtask since in most professional recordings, sources are recorded individually and panned across two stereo channels, researchers should take advantage of that.

Data

Since extracting F0 contours of all sources is a challenging task, the number of sources should be limited to 4-5 pitched instruments (no percussions). Annotating the ground truth data is an important issue, one option is to start with midi files and use a realistic synthesizer to create the data, to have completely accurate ground truth. A real world data set can be the RWC database but this database is already available to participants. Please make your recommendations on creating a database for this task.

Evaluation

The evaluation will be similar to the previous Audio Melody Extraction Tasks, based on the voicing and F0 detection for each source. Each F0-contour extracted from the song by the proposed system will be scored by one of the ground truth contours for that song that results in the highest score. Another score based on the raw frequency estimates per frame without tracking is also going to be reported.

Comments

chunghsin yeh

Reading the above suggestion we don't understand exactly how the contours are defined. If a contour is like a melody the problem seems ill-posed. Therefore, we suppose the different contours are related to f0 note contours. The task would then consist of multiple levels of evaluation using different data sets.

1. single frame evaluation

 using either artificially mixed monophonic samples:
 -- mixing with equal/non-equal energy
 -- random mix or musical mix
 or midi recordings as suggested above

Note, however, that even with midi recordings, the ground truth is not perfect, because note end events will not necessarily align with the end of the instruments sound, unless you are not planning to interrupt the sound. One may define a tolerance range after the note off event, where the f0 of the note may or may not be detected by the algorithms. The tolerance areas are not going to be evaluated as long as the f0 detected in this area is the correct f0 of the previous note.

2. multiple frames (tracking) evaluation

  using the midi database as above.

We're willing to share our single frame database (artificial mixtures) as well as some scripts for building the reference data.

cyeh(at)ircam.fr


mert bay

Thanks for you comments Chunghsin. Contour is all the F0`s generated by a single instrument. We should make this case feasible by constraining each instrument to play continously, one note at a time and each one having a distinct timbre. So the participants will not only have to extract all the F0`s per frame, also associate the extracted F0`s with the correct timbre.

Since more people are working only on estimation, we can clearly separate the evaluations (with tracking or single frame estimation) to two different tasks so that people can only perform the F0 estimation per frame basis if they don`t want to attempt tracking. No tracking score will be reported for them.

To annotate the ground truth from midis, we can synthesize each instrument separately, use a monophonic pitch detector to estimate the F0`s. Then manually verify it.

It is great that you are willing to share data. Do you have monophonic recordings of instruments playing solo passages or just single notes? If you have solo passages, we can also use it for tracking evaluation dataset. We can mix them artificially. The mix might not be musically meaningful however it will be good to obtain accurate ground truth.

Moderators

Mert Bay mertbay@uiuc.edu,Andreas Ehmann aehmann@uiuc.edu,Anssi Klaupri klap@cs.tut.fi