Difference between revisions of "2007:Multiple Fundamental Frequency Estimation & Tracking"

From MIREX Wiki
(Mert Bay)
(Submission Format)
 
(70 intermediate revisions by 11 users not shown)
Line 1: Line 1:
 
==Description==
 
==Description==
  
A complex music signal can be represented by the F0`s contours of its constituent sources which is very useful in most music information retrieval systems. There have been many attempts in multi-F0 estimation, and related area melody extraction.  
+
That a complex music signal can be represented by the F0 contours of its constituent sources is a very useful concept for most music information retrieval systems. There have been many attempts at multiple (aka polyphonic) F0 estimation and melody extraction, a related area. The goal of multiple F0 estimation and tracking is to identify the active F0s in each time frame and to track notes and timbres continuously in a complex music signal. In this task, we would like to evaluate state-of-the-art multiple-F0 estimation and tracking algorithms. Since F0 tracking of all sources in a complex audio mixture can be very hard, we are restricting the problem to 3 cases:
The goal of multiple F0 tracking is to extract contours of each source from a complex music signal. In this task we would like to evaluate the state-of-art multi-F0 tracking algorithms.
 
Since F0 tracking of all sources in a complex audio mixture can be very hard, we have to restrict our problem space.  The possible cases are:
 
  
1. Multiple instruments active at the same time but each playing monophonically (one note at a time) and each instrument having a different timbre in a single channel input.  
+
1. Estimate active fundamental frequencies on a frame-by-frame basis.
  
2. Multiple sources each playing polyphonically (e.g. chordsΓǪ) in a single channel input.  
+
2. Track note contours on a continuous time basis. (as in audio-to-midi). This task will also include a piano transcription sub task.
  
3. Multiple sources each playing polyphonically in a stereo panned mixture.  
+
3. Track timbre on a continous time basis. -  This task has been CANCELED due to lack of participation.
 
 
We are more interested in the more general but feasible first case. The third case, which is subset of first case should be considered as a subtask since in most professional recordings, sources are recorded individually and panned across two stereo channels, researchers should take advantage of that.
 
  
 
==Data==
 
==Data==
  
Since extracting F0 contours of all sources is a challenging task, the number of sources should be limited to 4-5 pitched instruments (no percussions).
+
A woodwind quintet transcription of the fifth variation from L. van Beethoven's Variations for String Quartet Op.18 No. 5.  Each part (flute, oboe, clarinet, horn, or bassoon) was recorded separately while the performer listened to the other parts (recorded previously) through headphones. Later the parts were mixed to a monaural 44.1kHz/16bits  file.
Annotating the ground truth data is an important issue, one option is to start with midi files and use a realistic synthesizer to create the data, to have completely accurate ground truth.  
 
A real world data set can be the RWC database but this database is already available to participants.  
 
Please make your recommendations on creating a database for this task.
 
 
 
==Evaluation==
 
  
The evaluation will be similar to the previous [https://www.music-ir.org/mirex2006/index.php/Audio_Melody_Extraction Audio Melody Extraction Tasks], based on the voicing and F0 detection for each source. Each  F0-contour extracted from the song by the proposed system will be scored by one of the ground truth contours for that song that results in the highest score.
+
Synthesized pieces using RWC MIDI and RWC samples. Includes pieces from Classical and Jazz collections. Polyphony changes from 1 to 4 sources.
Another score based on the raw frequency estimates per frame without tracking is also going to be reported.
 
  
==Comments==
+
Polyphonic piano recordings generated using a disklavier playback piano.
  
===chunghsin yeh===
+
So, there are 6, 30-sec clips for each polyphony (2-3-4-5) for a total of 30 examples, plus there are 10 30-sec polyphonic piano clips. Please email me about your estimated running time (in terms of n times realtime), if we believe everybodyΓÇÖs algorithm is fast enough, we can increase the number of test samples. (There were 90 x real-time algo`s for melody extraction tasks in the past.)
  
Reading the above suggestion we don't understand exactly how the contours are defined. If a contour is like a melody the problem seems ill-posed. Therefore, we suppose the different contours are related to f0 note contours. The task would then consist of multiple levels of evaluation using different data sets.  
+
All files are in 44.1kHz / 16 bit wave format. The development set can be found at
 +
[https://www.music-ir.org/evaluation/MIREX/data/2007/multiF0/index.htm          Development Set for MIREX 2007 MultiF0 Estimation  Tracking Task].
  
1. single frame evaluation
+
Send an email to [mailto:mertbay@uiuc.edu mertbay@uiuc.edu] for the username and password.
  using either artificially mixed monophonic samples:
 
  -- mixing with equal/non-equal energy
 
  -- random mix or musical mix
 
  or midi recordings as suggested above
 
  
Note, however, that even with midi recordings, the ground truth is not perfect, because note end events
+
==Evaluation==
will not necessarily align with the end of the instruments sound, unless you are not planning to interrupt the
 
sound. One may define a tolerance range after the note off event, where the
 
f0 of the  note may or may not be detected by the algorithms. The tolerance areas are not going to be
 
evaluated as long as the f0 detected in this area is the correct f0 of the previous note.
 
  
2. multiple frames (tracking) evaluation
+
For Task 1 (frame level evaluation), systems will report the number of active pitches every 10ms. Precision (the portion of correct retrieved pitches for all pitches retrieved for each frame) and Recall (the ratio of correct pitches to  all ground truth pitches for each frame) will be reported. A Returned Pitch is assumed to be correct if it is within a half semitone  (+ - 3%) of a ground-truth pitch for that frame. Only one ground-truth pitch can be associated with each Returned Pitch.
  using the midi database as above.
+
Also  as suggested, an error score as described in [http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/48317 Poliner and Ellis p.g. 5 ] will be calculated.
 +
The frame level ground truth  will be calculated by [http://www.ircam.fr/pcm/cheveign/sw/yin.zip YIN] and hand corrected.
  
We're willing to share our single frame database (artificial mixtures) as well as some scripts for building the reference data.
+
For Task 2 (note tracking), again Precision (the ratio of correctly transcribed ground truth notes to the  number of ground truth notes for that input clip) and Recall (ratio of correctly transcribed ground truth notes to the number of transcribed notes) will be reported. A ground truth note is assumed to be correctly transcribed if the system returns a note that is within a half semitone (+ - 3%) of that note AND the returned note`s onset is within a 50ms range( + - 25ms) of the onset of the ground truth note, and its offset is within 20% range of the ground truth note`s offset. Again, one ground truth note can only be associated with one transcribed note.
  
[mailto:cyeh(at)ircam.fr cyeh(at)ircam.fr]
+
The ground truth for this task will be annotated by hand. An amplitude threshold relative to the file/instrument will be determined. Note onset is going to be set to the time where its amplitude rises higher than the threshold  and the offset is going to be set to the the time where the note`s amplitude decays lower than the threshold. The ground truth is going to be set as the average F0 between the onset and the offset of the note.
 +
In the case of legato, the onset/offset is going to be set to the time where the F0 deviates more than 3% of the average F0 through out the the note up to that point. There is not going to be any vibrato larger than a half semitone in the test data.
  
 +
Different statistics can also be reported if agreed by the participants.
  
===mert bay===
+
== Submission Format ==
  
Thanks for you comments Chunghsin. Contour is all the F0`s generated by a single instrument. We should make this case feasible by constraining  each instrument to play continously, one note at a time and each one having a distinct timbre. So the participants will not only have to extract all the F0`s per frame, also associate the extracted F0`s with the correct timbre.
+
Submissions have to conform to the specified format below:
  
Since more people are working only on estimation, we can clearly separate the evaluations (with tracking or single frame estimation) to two different tasks so that people can only perform the F0 estimation per frame basis if they don`t want to attempt tracking. No tracking score will be reported for them.  
+
  ''doMultiF0 "path/to/file.wav" "path/to/output/file.F0" ''
  
To annotate the ground truth from midis, we can synthesize each instrument separately,  use a monophonic pitch detector to estimate the F0`s. Then manually verify it.  
+
path/to/file.wav: Path to the input audio file.
  
It is great that you are willing to share data. Do you have monophonic recordings of instruments playing solo passages or just single notes? If you have solo passages, we can also use it for tracking evaluation dataset. We can mix them artificially. The mix might not be musically meaningful however it will be good to obtain accurate ground truth.
+
path/to/output/file.F0: The output file.  
  
 +
Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.
  
===chunghsin yeh===
+
For each task, the format of the output file is going to be different:
 +
For the first task, F0-estimation on frame basis, the output will be a file where each row has a  time stamp and a number of active F0s in that frame, separated by a tab for every 10ms increments.
 +
 +
Example :
 +
''time F01 F02 F03 ''
 +
''time F01 F02 F03 F04''
 +
''time ... ... ... ...''
  
F0 contour detection as you have defined can only be done for case 1:
+
which might look like:
"Multiple instruments active at the same time but each playing monophonically ..."
 
  
To our understanding, there are two levels of F0 tracking:
+
''0.78 146.83 220.00 349.23''
 +
''0.79 349.23 146.83 369.99 220.00 ''
 +
''0.80 ... ... ... ...''
  
1. tracking note contours without taking into account the instrument timbre (possible for case 1-3)
+
For the second task,  for each row, the file should contain  the onset, offset and the F0 of each note event separated by a tab, ordered in terms of onset times:
  
2. tracking note contours with similar instrument timbre (for case 1 only)
+
onset offset F01
 +
onset offset F02
 +
... ... ...
 +
which might look like:
  
Therefore, our suggestion would be to have three levels of evaluation: single frame, note contour and instrument contour.
+
0.68 1.20 349.23
 +
0.72 1.02 220.00
 +
... ... ...
 +
The DEADLINE is Friday August 31.
  
We have artificial mixtures of single notes using databases of McGill, Iowa, IRCAM amd RWC. We also have a collection solo instrument recordings but we would like to study the rights issues.
+
==Poll==
  
[mailto:cyeh(at)ircam.fr cyeh(at)ircam.fr]
+
<poll>
 +
Would you like the maximum number of concurrent sources given as an input parameter to the systems?
 +
Yes
 +
No
 +
</poll>
  
===mert bay===
+
The maximum number of polyphony will NOT be an input to the systems.
Yes, the instrument tracking can only be done feasibly for case 1.  We can have these 3 levels of evaluation as you said, with different datasets and participants can submit their systems to whichever is suitable.
 
I guess we won`t have problem about creating a dataset from artificial mixtures of single notes. We are planning to organize a recording session, to record each instrument separately in a quintet but at the end we`ll only have a limited amount of data. We appreciate if you can share your collection of instrument recordings. That data will not be public, only be used for evaluation purposes.
 
  
===Eric Nichols===
+
==Comments==
I am interested in participating, although I hope that this task ends up using real audio recordings for the competition this year. I am not particularly interested in the results of using artificially synthesized audio -- even though it may ease the creation of ground truth, it seems to invite different sorts of solutions than one would develop for the general case of polyphonic audio recognition.
 
 
 
I suggest that this task uses real recordings that have been annotated. This annotation can be accomplished by performing an automated match between a score/MIDI file and audio recording, and then the automated match can be hand-tuned to ensure reliable ground truth. I agree that the task should deal with monophonic audio.
 
 
 
I don't see why the distinction should be made between monophonic and polyphonic instruments -- a system to track multiple f0s should handle two flutes in the same mammer as a funny "polyphonic flute", just as it should handle a monophonic line on a piano as well as chords on a piano. I suppose this is because I am biased by an interest in reducing audio to "piano reduction"-type scores, and I'm not as interested in detecting the particular instruments playing each note.
 
 
 
I would furthur suggest starting with rather simple instrumental pieces, to facilitate creation of the annotation and to simplify the problem domain, which is very complex. For example, small chamber pieces such as duos, trios, quartets, etc., simple piano pieces and perhaps even simple examples of full orchestral music could be included.
 
 
 
 
 
===Mert Bay===
 
 
 
I would like to announce that we are going to be using real audio. Recently we did the first stage of the recording session here in Univ. of Illinois, where a professional quinted (flute, bassoon, clarinet,horn, oboe) played a 6 minute piece by Beethoven, recorded close mic`ed. The next stage would be to get every performer one by one to record indiviudual tracks while the performer is listening to the rest of the instruments through headphones. We can get the ground truth using a monophonic pitch detector and then hand correct the contours. Combinations of  2-3-4 out 5 tracks  of 30 second incipits  can be used as our dataset.
 
I`ll upload a portion of the data as a development set once the the recordings are completed.
 
 
 
 
 
The polyphonic instruments will cause problems in most trackers, although there are no problems for F0 estimation (then evaluate on frame by frame basis).  Do you suggest that the systems should output the  F0`s of the polyphonic instrument in on track? We can create another category for that if more than 3 people wants their system to be evaluated in that category.
 
 
 
The reason for evaluation on the note contour basis is, some systems which tracks according to the continuity of the spectrum tends to lock into a note of a particular instrument without taking into account the timbre but when the note ends and its energy gets weaker, the system will start to track another note of another intrument. If your system can perform tracking on the instrument level fine, then it shouldn`t be problem for you to segment the note boundaries. Ofcourse  you don`t have to be evaluated in that if you don`t want to.
 
 
 
As a discussed above, for right now three levels of evaluation is being considered
 
 
 
1. Evaluate the F0 esimators: Frame by frame basis. Systems should output the active frequencies in each frame. This dataset can have polyphonic instruments.
 
 
 
2. Evaluate F0 trackers on note level. Each system should output a F0 contour for each note from each instrument. This data can also have polyphonic instruments
 
 
 
3. Evaluate F0 trackers on instrument level. Each system should output the F0 contour of each instrument.
 
 
 
===Eric Nichols===
 
Thanks for the clarification. I still have a few more questions. I'm mostly interested in version 1 of the task above, if I understand things correctly. Our algorithm outputs a set of sounding notes at each frame, without tracking individual instruments. Please correct me if this is wrong, but I think that this means that task 1 is the correct one for this system.
 
 
 
I also would like to know what is meant by frequency tracking -- our system outputs symbolic sets of sounding notes (i.e. MIDI numbers), not actual frequency in Hz. Is this task concerned with tracking the fine-tuning of each note over time?
 
 
 
Finally, in the proposed task, are the instruments known ahead of time, for use in training/tuning algorithms for those particular sounds? For instance, should the participating systems be tuned for flute, bassoon, clarinet, horn, and oboe?
 
 
 
===Mert Bay===
 
Since your algorithm peforms only F0 estimation not tracking, first evaluation criteria is suitable for you.
 
For the first criteria, the estimated F0s will be evaluated as correct if it is in  a semitone range (+ or - quarter tone) of the ground truth F0s for that frame. Every reported F0 will be checked against the closest F0 in the ground truth. The algorithms will report a  set  of F0s every 10 ms. I suggest, the systems should know the total number of sources in the piece before hand. However each source will not be active in every frame. For the second criteria, we have to decide on a reasonable range for onset / offset timings for the reported  notes as opposed the ground truth timings. Third criteria will be evaluated like the first one except this time each reported F0-track for the whole piece will be evaluated against the closest ground truth F0 track. If the instrument is not active, the system should report an F0 of 0Hz, or midi note number 0.
 
  
For the 3rd evaluation criteria, the systems will have the oppurtunity to be tuned for the instruments which is not necessary for F0 estimation. That`s why we will soon release a small training set of individual tracks of a woodwind quintet playing a highly counterpuntal piece.
+
See the [[ Discussion Page for Multiple Fundamental Frequency Estimation & Tracking]]
  
 
== Potential Participants ==
 
== Potential Participants ==
If  you might consider participating, please add your name and email address here.  
+
If  you might consider participating, please add your name and email address here and also please sign up for the Multi-F0  mail list:
 +
[https://mail.lis.uiuc.edu/mailman/listinfo/mrx-com03 Multi-F0 Estimation Tracking email list]
  
*
+
* Koji Egashira (egashira (at) hil_t_u-tokyo_ac_jp)
 +
* Stanisław Raczyński (raczynski (at) hil_t_u-tokyo_ac_jp)
 +
* Pierre Leveau (pierre.leveau (at) enst_fr)
 +
* Valentin Emiya (valentin.emiya (at) enst_fr)
 +
* Chunghsin Yeh (cyeh (at) ircam_fr)
 +
* Emmanuel Vincent (emmanuel.vincent (at) irisa_fr) and Nancy Bertin (nancy.bertin (at) enst_fr)
 +
* Matti Ryynänen and Anssi Klapuri (matti.ryynanen (at) tut_fi, anssi.klapuri (at) tut_fi)
 +
* Ruohua Zhou (ruouhua.zhou@qmul.ac.uk)
 +
* Arshia Cont (acont (_at_) ucsd.edu)
 +
* Chuan Cao and Ming Li ({ccao,mli} (at) hccl.ioa.ac.cn)
 +
* Graham Poliner and Dan Ellis ({graham,dpwe} (at) ee.columbia.edu)
  
 
==Moderators==
 
==Moderators==
 
Mert Bay [mailto:mertbay@uiuc.edu mertbay@uiuc.edu],Andreas Ehmann [mailto:aehmann@uiuc.edu aehmann@uiuc.edu],Anssi Klaupri [mailto:klap@cs.tut.fi klap@cs.tut.fi]
 
Mert Bay [mailto:mertbay@uiuc.edu mertbay@uiuc.edu],Andreas Ehmann [mailto:aehmann@uiuc.edu aehmann@uiuc.edu],Anssi Klaupri [mailto:klap@cs.tut.fi klap@cs.tut.fi]

Latest revision as of 13:15, 29 August 2007

Description

That a complex music signal can be represented by the F0 contours of its constituent sources is a very useful concept for most music information retrieval systems. There have been many attempts at multiple (aka polyphonic) F0 estimation and melody extraction, a related area. The goal of multiple F0 estimation and tracking is to identify the active F0s in each time frame and to track notes and timbres continuously in a complex music signal. In this task, we would like to evaluate state-of-the-art multiple-F0 estimation and tracking algorithms. Since F0 tracking of all sources in a complex audio mixture can be very hard, we are restricting the problem to 3 cases:

1. Estimate active fundamental frequencies on a frame-by-frame basis.

2. Track note contours on a continuous time basis. (as in audio-to-midi). This task will also include a piano transcription sub task.

3. Track timbre on a continous time basis. - This task has been CANCELED due to lack of participation.

Data

A woodwind quintet transcription of the fifth variation from L. van Beethoven's Variations for String Quartet Op.18 No. 5. Each part (flute, oboe, clarinet, horn, or bassoon) was recorded separately while the performer listened to the other parts (recorded previously) through headphones. Later the parts were mixed to a monaural 44.1kHz/16bits file.

Synthesized pieces using RWC MIDI and RWC samples. Includes pieces from Classical and Jazz collections. Polyphony changes from 1 to 4 sources.

Polyphonic piano recordings generated using a disklavier playback piano.

So, there are 6, 30-sec clips for each polyphony (2-3-4-5) for a total of 30 examples, plus there are 10 30-sec polyphonic piano clips. Please email me about your estimated running time (in terms of n times realtime), if we believe everybodyΓÇÖs algorithm is fast enough, we can increase the number of test samples. (There were 90 x real-time algo`s for melody extraction tasks in the past.)

All files are in 44.1kHz / 16 bit wave format. The development set can be found at Development Set for MIREX 2007 MultiF0 Estimation Tracking Task.

Send an email to mertbay@uiuc.edu for the username and password.

Evaluation

For Task 1 (frame level evaluation), systems will report the number of active pitches every 10ms. Precision (the portion of correct retrieved pitches for all pitches retrieved for each frame) and Recall (the ratio of correct pitches to all ground truth pitches for each frame) will be reported. A Returned Pitch is assumed to be correct if it is within a half semitone (+ - 3%) of a ground-truth pitch for that frame. Only one ground-truth pitch can be associated with each Returned Pitch. Also as suggested, an error score as described in Poliner and Ellis p.g. 5 will be calculated. The frame level ground truth will be calculated by YIN and hand corrected.

For Task 2 (note tracking), again Precision (the ratio of correctly transcribed ground truth notes to the number of ground truth notes for that input clip) and Recall (ratio of correctly transcribed ground truth notes to the number of transcribed notes) will be reported. A ground truth note is assumed to be correctly transcribed if the system returns a note that is within a half semitone (+ - 3%) of that note AND the returned note`s onset is within a 50ms range( + - 25ms) of the onset of the ground truth note, and its offset is within 20% range of the ground truth note`s offset. Again, one ground truth note can only be associated with one transcribed note.

The ground truth for this task will be annotated by hand. An amplitude threshold relative to the file/instrument will be determined. Note onset is going to be set to the time where its amplitude rises higher than the threshold and the offset is going to be set to the the time where the note`s amplitude decays lower than the threshold. The ground truth is going to be set as the average F0 between the onset and the offset of the note. In the case of legato, the onset/offset is going to be set to the time where the F0 deviates more than 3% of the average F0 through out the the note up to that point. There is not going to be any vibrato larger than a half semitone in the test data.

Different statistics can also be reported if agreed by the participants.

Submission Format

Submissions have to conform to the specified format below:

doMultiF0 "path/to/file.wav"  "path/to/output/file.F0" 

path/to/file.wav: Path to the input audio file.

path/to/output/file.F0: The output file.

Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.

For each task, the format of the output file is going to be different: For the first task, F0-estimation on frame basis, the output will be a file where each row has a time stamp and a number of active F0s in that frame, separated by a tab for every 10ms increments.

Example :

time	F01	F02	F03	
time	F01	F02	F03	F04
time	...	...	...	...

which might look like:

0.78	146.83	220.00	349.23
0.79	349.23	146.83	369.99	220.00	
0.80	...	...	...	...

For the second task, for each row, the file should contain the onset, offset and the F0 of each note event separated by a tab, ordered in terms of onset times:

onset	offset F01
onset	offset F02
...	... ...

which might look like:

0.68	1.20	349.23
0.72	1.02	220.00
...	...	...

The DEADLINE is Friday August 31.

Poll

<poll> Would you like the maximum number of concurrent sources given as an input parameter to the systems? Yes No </poll>

The maximum number of polyphony will NOT be an input to the systems.

Comments

See the Discussion Page for Multiple Fundamental Frequency Estimation & Tracking

Potential Participants

If you might consider participating, please add your name and email address here and also please sign up for the Multi-F0 mail list: Multi-F0 Estimation Tracking email list

  • Koji Egashira (egashira (at) hil_t_u-tokyo_ac_jp)
  • Stanis┼éaw Raczy┼äski (raczynski (at) hil_t_u-tokyo_ac_jp)
  • Pierre Leveau (pierre.leveau (at) enst_fr)
  • Valentin Emiya (valentin.emiya (at) enst_fr)
  • Chunghsin Yeh (cyeh (at) ircam_fr)
  • Emmanuel Vincent (emmanuel.vincent (at) irisa_fr) and Nancy Bertin (nancy.bertin (at) enst_fr)
  • Matti Ryyn├ñnen and Anssi Klapuri (matti.ryynanen (at) tut_fi, anssi.klapuri (at) tut_fi)
  • Ruohua Zhou (ruouhua.zhou@qmul.ac.uk)
  • Arshia Cont (acont (_at_) ucsd.edu)
  • Chuan Cao and Ming Li ({ccao,mli} (at) hccl.ioa.ac.cn)
  • Graham Poliner and Dan Ellis ({graham,dpwe} (at) ee.columbia.edu)

Moderators

Mert Bay mertbay@uiuc.edu,Andreas Ehmann aehmann@uiuc.edu,Anssi Klaupri klap@cs.tut.fi