Difference between revisions of "2007:Multiple Fundamental Frequency Estimation & Tracking"

From MIREX Wiki
(Submission Format)
 
(79 intermediate revisions by 12 users not shown)
Line 1: Line 1:
 
==Description==
 
==Description==
  
A complex music signal can be represented by the F0`s contours of its constituent sources which is very useful in most music information retrieval systems. There have been many attempts in multi-F0 estimation, and related area melody extraction.  
+
That a complex music signal can be represented by the F0 contours of its constituent sources is a very useful concept for most music information retrieval systems. There have been many attempts at multiple (aka polyphonic) F0 estimation and melody extraction, a related area. The goal of multiple F0 estimation and tracking is to identify the active F0s in each time frame and to track notes and timbres continuously in a complex music signal. In this task, we would like to evaluate state-of-the-art multiple-F0 estimation and tracking algorithms. Since F0 tracking of all sources in a complex audio mixture can be very hard, we are restricting the problem to 3 cases:
The goal of multiple F0 tracking is to extract contours of each source from a complex music signal. In this task we would like to evaluate the state-of-art multi-F0 tracking algorithms.
 
Since F0 tracking of all sources in a complex audio mixture can be very hard, we have to restrict our problem space.  The possible cases are:
 
  
1. Multiple instruments active at the same time but each playing monophonically (one note at a time) and each instrument having a different timbre in a single channel input.  
+
1. Estimate active fundamental frequencies on a frame-by-frame basis.
  
2. Multiple sources each playing polyphonically (e.g. chordsΓǪ) in a single channel input.  
+
2. Track note contours on a continuous time basis. (as in audio-to-midi). This task will also include a piano transcription sub task.
  
3. Multiple sources each playing polyphonically in a stereo panned mixture.  
+
3. Track timbre on a continous time basis. -  This task has been CANCELED due to lack of participation.
  
We are more interested in the more general but feasible first case. The third case, which is subset of first case should be considered as a subtask since in most professional recordings, sources are recorded individually and panned across two stereo channels, researchers should take advantage of that.
+
==Data==
 +
 
 +
A woodwind quintet transcription of the fifth variation from L. van Beethoven's Variations for String Quartet Op.18 No. 5.  Each part (flute, oboe, clarinet, horn, or bassoon) was recorded separately while the performer listened to the other parts (recorded previously) through headphones. Later the parts were mixed to a monaural 44.1kHz/16bits  file.
 +
 
 +
Synthesized pieces using RWC MIDI and RWC samples. Includes pieces from Classical and Jazz collections. Polyphony changes from 1 to 4 sources.
 +
 
 +
Polyphonic piano recordings generated using a disklavier playback piano.
  
==Data==
+
So, there are 6, 30-sec clips for each polyphony (2-3-4-5) for a total of 30 examples, plus there are 10 30-sec polyphonic piano clips. Please email me about your estimated running time (in terms of n times realtime), if we believe everybodyΓÇÖs algorithm is fast enough, we can increase the number of test samples. (There were 90 x real-time algo`s for melody extraction tasks in the past.)
 +
 
 +
All files are in 44.1kHz / 16 bit wave format. The development set can be found at
 +
[https://www.music-ir.org/evaluation/MIREX/data/2007/multiF0/index.htm          Development Set for MIREX 2007 MultiF0 Estimation  Tracking Task]. 
  
Since extracting F0 contours of all sources is a challenging task, the number of sources should be limited to 4-5 pitched instruments (no percussions).
+
Send an email to [mailto:mertbay@uiuc.edu mertbay@uiuc.edu] for the username and password.
Annotating the ground truth data is an important issue, one option is to start with midi files and use a realistic synthesizer to create the data, to have completely accurate ground truth.  
 
A real world data set can be the RWC database but this database is already available to participants.  
 
Please make your recommendations on creating a database for this task.
 
  
 
==Evaluation==
 
==Evaluation==
  
The evaluation will be similar to the previous [https://www.music-ir.org/mirex2006/index.php/Audio_Melody_Extraction Audio Melody Extraction Tasks], based on the voicing and F0 detection for each source. Each  F0-contour extracted from the song by the proposed system will be scored by one of the ground truth contours for that song that results in the highest score.  
+
For Task 1 (frame level evaluation), systems will report the number of active pitches every 10ms. Precision (the portion of correct retrieved pitches for all pitches retrieved for each frame) and Recall (the ratio of correct pitches to  all ground truth pitches for each frame) will be reported. A Returned Pitch is assumed to be correct if it is within a half semitone  (+ - 3%) of a ground-truth pitch for that frame. Only one ground-truth pitch can be associated with each Returned Pitch.
Another score based on the raw frequency estimates per frame without tracking is also going to be reported.
+
Also  as suggested, an error score as described in [http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/48317 Poliner and Ellis p.g. 5 ] will be calculated.  
 +
The frame level ground truth  will be calculated by [http://www.ircam.fr/pcm/cheveign/sw/yin.zip YIN] and hand corrected.
 +
 
 +
For Task 2 (note tracking), again Precision (the ratio of correctly transcribed ground truth notes to the  number of ground truth notes for that input clip) and Recall (ratio of correctly transcribed ground truth notes to the number of transcribed notes) will be reported. A ground truth note is assumed to be correctly transcribed if the system returns a note that is within a half semitone (+ - 3%) of that note AND the returned note`s onset is within a 50ms range( + - 25ms) of the onset of the ground truth note, and its offset is within 20% range of the ground truth note`s offset. Again, one ground truth note can only be associated with one transcribed note.
 +
 
 +
The ground truth for this task will be annotated by hand. An amplitude threshold relative to the file/instrument will be determined. Note onset is going to be set to the time where its amplitude rises higher than the threshold  and the offset is going to be set to the the time where the note`s amplitude decays lower than the threshold. The ground truth is going to be set as the average F0 between the onset and the offset of the note.
 +
In the case of legato, the onset/offset is going to be set to the time where the F0 deviates more than 3% of the average F0 through out the the note up to that point. There is not going to be any vibrato larger than a half semitone in the test data.
 +
 
 +
Different statistics can also be reported if agreed by the participants.
 +
 
 +
== Submission Format ==
 +
 
 +
Submissions have to conform to the specified format below:
 +
 
 +
''doMultiF0 "path/to/file.wav"  "path/to/output/file.F0" ''
 +
 
 +
path/to/file.wav: Path to the input audio file.
 +
 
 +
path/to/output/file.F0: The output file.
 +
 
 +
Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.
  
==Comments==
+
For each task, the format of the output file is going to be different:
 +
For the first task, F0-estimation on frame basis, the output will be a file where each row has a  time stamp and a number of active F0s in that frame, separated by a tab for every 10ms increments.
 +
 +
Example :
 +
''time F01 F02 F03 ''
 +
''time F01 F02 F03 F04''
 +
''time ... ... ... ...''
 +
 
 +
which might look like:
 +
 
 +
''0.78 146.83 220.00 349.23''
 +
''0.79 349.23 146.83 369.99 220.00 ''
 +
''0.80 ... ... ... ...''
 +
 
 +
For the second task,  for each row, the file should contain  the onset, offset and the F0 of each note event separated by a tab, ordered in terms of onset times:
 +
 
 +
onset offset F01
 +
onset offset F02
 +
... ... ...
 +
which might look like:
 +
 
 +
0.68 1.20 349.23
 +
0.72 1.02 220.00
 +
... ... ...
 +
The DEADLINE is Friday August 31.
  
===chunghsin yeh===
+
==Poll==
  
Reading the above suggestion we don't understand exactly how the contours are defined. If a contour is like a melody the problem seems ill-posed. Therefore, we suppose the different contours are related to f0 note contours. The task would then consist of multiple levels of evaluation using different data sets.
+
<poll>
 +
Would you like the maximum number of concurrent sources given as an input parameter to the systems?
 +
Yes
 +
No
 +
</poll>
  
1. single frame evaluation
+
The maximum number of polyphony will NOT be an input to the systems.
  using either artificially mixed monophonic samples:
 
  -- mixing with equal/non-equal energy
 
  -- random mix or musical mix
 
  or midi recordings as suggested above
 
  
Note, however, that even with midi recordings, the ground truth is not perfect, because note end events
+
==Comments==
will not necessarily align with the end of the instruments sound, unless you are not planning to interrupt the
 
sound. One may define a tolerance range after the note off event, where the
 
f0 of the  note may or may not be detected by the algorithms. The tolerance areas are not going to be
 
evaluated as long as the f0 detected in this area is the correct f0 of the previous note.
 
  
2. multiple frames (tracking) evaluation
+
See the [[ Discussion Page for Multiple Fundamental Frequency Estimation & Tracking]]
  using the midi database as above.
 
  
We're willing to share our single frame database (artificial mixtures) as well as some scripts for building the reference data.
+
== Potential Participants ==
 +
If  you might consider participating, please add your name and email address here and also please sign up for the Multi-F0  mail list:
 +
[https://mail.lis.uiuc.edu/mailman/listinfo/mrx-com03 Multi-F0 Estimation Tracking email list]
  
[mailto:cyeh(at)ircam.fr cyeh(at)ircam.fr]
+
* Koji Egashira (egashira (at) hil_t_u-tokyo_ac_jp)
 +
* Stanisław Raczyński (raczynski (at) hil_t_u-tokyo_ac_jp)
 +
* Pierre Leveau (pierre.leveau (at) enst_fr)
 +
* Valentin Emiya (valentin.emiya (at) enst_fr)
 +
* Chunghsin Yeh (cyeh (at) ircam_fr)
 +
* Emmanuel Vincent (emmanuel.vincent (at) irisa_fr) and Nancy Bertin (nancy.bertin (at) enst_fr)
 +
* Matti Ryynänen and Anssi Klapuri (matti.ryynanen (at) tut_fi, anssi.klapuri (at) tut_fi)
 +
* Ruohua Zhou (ruouhua.zhou@qmul.ac.uk)
 +
* Arshia Cont (acont (_at_) ucsd.edu)
 +
* Chuan Cao and Ming Li ({ccao,mli} (at) hccl.ioa.ac.cn)
 +
* Graham Poliner and Dan Ellis ({graham,dpwe} (at) ee.columbia.edu)
  
 
==Moderators==
 
==Moderators==
 
Mert Bay [mailto:mertbay@uiuc.edu mertbay@uiuc.edu],Andreas Ehmann [mailto:aehmann@uiuc.edu aehmann@uiuc.edu],Anssi Klaupri [mailto:klap@cs.tut.fi klap@cs.tut.fi]
 
Mert Bay [mailto:mertbay@uiuc.edu mertbay@uiuc.edu],Andreas Ehmann [mailto:aehmann@uiuc.edu aehmann@uiuc.edu],Anssi Klaupri [mailto:klap@cs.tut.fi klap@cs.tut.fi]

Latest revision as of 13:15, 29 August 2007

Description

That a complex music signal can be represented by the F0 contours of its constituent sources is a very useful concept for most music information retrieval systems. There have been many attempts at multiple (aka polyphonic) F0 estimation and melody extraction, a related area. The goal of multiple F0 estimation and tracking is to identify the active F0s in each time frame and to track notes and timbres continuously in a complex music signal. In this task, we would like to evaluate state-of-the-art multiple-F0 estimation and tracking algorithms. Since F0 tracking of all sources in a complex audio mixture can be very hard, we are restricting the problem to 3 cases:

1. Estimate active fundamental frequencies on a frame-by-frame basis.

2. Track note contours on a continuous time basis. (as in audio-to-midi). This task will also include a piano transcription sub task.

3. Track timbre on a continous time basis. - This task has been CANCELED due to lack of participation.

Data

A woodwind quintet transcription of the fifth variation from L. van Beethoven's Variations for String Quartet Op.18 No. 5. Each part (flute, oboe, clarinet, horn, or bassoon) was recorded separately while the performer listened to the other parts (recorded previously) through headphones. Later the parts were mixed to a monaural 44.1kHz/16bits file.

Synthesized pieces using RWC MIDI and RWC samples. Includes pieces from Classical and Jazz collections. Polyphony changes from 1 to 4 sources.

Polyphonic piano recordings generated using a disklavier playback piano.

So, there are 6, 30-sec clips for each polyphony (2-3-4-5) for a total of 30 examples, plus there are 10 30-sec polyphonic piano clips. Please email me about your estimated running time (in terms of n times realtime), if we believe everybodyΓÇÖs algorithm is fast enough, we can increase the number of test samples. (There were 90 x real-time algo`s for melody extraction tasks in the past.)

All files are in 44.1kHz / 16 bit wave format. The development set can be found at Development Set for MIREX 2007 MultiF0 Estimation Tracking Task.

Send an email to mertbay@uiuc.edu for the username and password.

Evaluation

For Task 1 (frame level evaluation), systems will report the number of active pitches every 10ms. Precision (the portion of correct retrieved pitches for all pitches retrieved for each frame) and Recall (the ratio of correct pitches to all ground truth pitches for each frame) will be reported. A Returned Pitch is assumed to be correct if it is within a half semitone (+ - 3%) of a ground-truth pitch for that frame. Only one ground-truth pitch can be associated with each Returned Pitch. Also as suggested, an error score as described in Poliner and Ellis p.g. 5 will be calculated. The frame level ground truth will be calculated by YIN and hand corrected.

For Task 2 (note tracking), again Precision (the ratio of correctly transcribed ground truth notes to the number of ground truth notes for that input clip) and Recall (ratio of correctly transcribed ground truth notes to the number of transcribed notes) will be reported. A ground truth note is assumed to be correctly transcribed if the system returns a note that is within a half semitone (+ - 3%) of that note AND the returned note`s onset is within a 50ms range( + - 25ms) of the onset of the ground truth note, and its offset is within 20% range of the ground truth note`s offset. Again, one ground truth note can only be associated with one transcribed note.

The ground truth for this task will be annotated by hand. An amplitude threshold relative to the file/instrument will be determined. Note onset is going to be set to the time where its amplitude rises higher than the threshold and the offset is going to be set to the the time where the note`s amplitude decays lower than the threshold. The ground truth is going to be set as the average F0 between the onset and the offset of the note. In the case of legato, the onset/offset is going to be set to the time where the F0 deviates more than 3% of the average F0 through out the the note up to that point. There is not going to be any vibrato larger than a half semitone in the test data.

Different statistics can also be reported if agreed by the participants.

Submission Format

Submissions have to conform to the specified format below:

doMultiF0 "path/to/file.wav"  "path/to/output/file.F0" 

path/to/file.wav: Path to the input audio file.

path/to/output/file.F0: The output file.

Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.

For each task, the format of the output file is going to be different: For the first task, F0-estimation on frame basis, the output will be a file where each row has a time stamp and a number of active F0s in that frame, separated by a tab for every 10ms increments.

Example :

time	F01	F02	F03	
time	F01	F02	F03	F04
time	...	...	...	...

which might look like:

0.78	146.83	220.00	349.23
0.79	349.23	146.83	369.99	220.00	
0.80	...	...	...	...

For the second task, for each row, the file should contain the onset, offset and the F0 of each note event separated by a tab, ordered in terms of onset times:

onset	offset F01
onset	offset F02
...	... ...

which might look like:

0.68	1.20	349.23
0.72	1.02	220.00
...	...	...

The DEADLINE is Friday August 31.

Poll

<poll> Would you like the maximum number of concurrent sources given as an input parameter to the systems? Yes No </poll>

The maximum number of polyphony will NOT be an input to the systems.

Comments

See the Discussion Page for Multiple Fundamental Frequency Estimation & Tracking

Potential Participants

If you might consider participating, please add your name and email address here and also please sign up for the Multi-F0 mail list: Multi-F0 Estimation Tracking email list

  • Koji Egashira (egashira (at) hil_t_u-tokyo_ac_jp)
  • Stanis┼éaw Raczy┼äski (raczynski (at) hil_t_u-tokyo_ac_jp)
  • Pierre Leveau (pierre.leveau (at) enst_fr)
  • Valentin Emiya (valentin.emiya (at) enst_fr)
  • Chunghsin Yeh (cyeh (at) ircam_fr)
  • Emmanuel Vincent (emmanuel.vincent (at) irisa_fr) and Nancy Bertin (nancy.bertin (at) enst_fr)
  • Matti Ryyn├ñnen and Anssi Klapuri (matti.ryynanen (at) tut_fi, anssi.klapuri (at) tut_fi)
  • Ruohua Zhou (ruouhua.zhou@qmul.ac.uk)
  • Arshia Cont (acont (_at_) ucsd.edu)
  • Chuan Cao and Ming Li ({ccao,mli} (at) hccl.ioa.ac.cn)
  • Graham Poliner and Dan Ellis ({graham,dpwe} (at) ee.columbia.edu)

Moderators

Mert Bay mertbay@uiuc.edu,Andreas Ehmann aehmann@uiuc.edu,Anssi Klaupri klap@cs.tut.fi