2010:Multiple Fundamental Frequency Estimation & Tracking
Contents
Description
The text of this section is copied from the 2009 page. Please add your comments and discussions for 2010.
That a complex music signal can be represented by the F0 contours of its constituent sources is a very useful concept for most music information retrieval systems. There have been many attempts at multiple (aka polyphonic) F0 estimation and melody extraction, a related area. The goal of multiple F0 estimation and tracking is to identify the active F0s in each time frame and to track notes and timbres continuously in a complex music signal. In this task, we would like to evaluate state-of-the-art multiple-F0 estimation and tracking algorithms. Since F0 tracking of all sources in a complex audio mixture can be very hard, we are restricting the problem to 3 cases:
1. Estimate active fundamental frequencies on a frame-by-frame basis.
2. Track note contours on a continuous time basis. (as in audio-to-midi). This task will also include a piano transcription sub task.
3. Track timbre on a continous time basis.
The deadline For this task is September 8th. Please feel free to request extension if needed.
Discussions for 2010
Discussions from 2009
Zhiyao's Comments 20/07/2009
Hi everyone, I'm glad that our team want to participate this task this year! Since our team is new, we have several questions about the evaluation. Thank you in advance for any answers and thoughts.
1. Will the energy threshold that was used to decide if there is a pitch or not in the ground-truth data be provided to the participants? I think this threshold will affect the precision and recall results much.
2. In sub-task 2 (note-level), what does the 20% threshold mean for the offset matching? Does that mean that a note is correctly estimated if the estimated offset deviates less than 20% of the note length from the ground-truth offset?
3. For sub-task 2, It seems to me that the 50ms (+- 25ms) threshold for onset matching is quite strict. First, since the frame-hop we are using is 10ms. This means that only 2 frames deviation is acceptable. But I think it's hard to make sure that the ground-truth onsets themselves are so precise, am I right? Second, since our method doesn't have an onset detection module, it's very often that the estimated onset deviates more than 25ms from the ground-truth one. In this case, according to the 50ms threshold, this note is wrongly estimated. If we view it globally, however, the note is in the right position compared to other notes and the converted MIDI sounds correct. Does anyone else have this problem? Can we loosen the 50ms threshold, say, to 100ms or something?
4. For sub-task 3 (timbre-level), it seems that it's not active last year. This year we are interested in this sub-task. Is there anyone that also want to do this one? To me, sub-task 3 is like this: The polyphonic music for testing consists of several monophonic harmonic sources (harmonic instruments). The sub-task is to estimate and track the pitches for each source. To do this, there can be two levels of evaluations:
1) frame-level: what the system outputs is the same as sub-task 1, except that each pitch output has a source label (instrument 1, instrument 2, etc), so that the pitches are tracked. Then we can evaluate precision, recall, etc. of these pitches for each instrument (then average maybe). A pitch estimate is thought as correct if its time, frequency AND SOURCE LABEL are all correct compared to a ground-truth pitch.
2) note-level: what the system outputs is the same as sub-task 2, except that each note output has a source label (instrument 1, instrument 2, etc), so that the notes are tracked. Then we can evaluate precision, recall, overlap ratio, etc. of these notes for each instrument (then average maybe). A note estimate is thought as correct if its frequency, onset, offset AND SOURCE LABEL are all correct to a ground-truth pitch.
I think it's reasonable to evaluate the timbre-level sub-task in the above two ways, this is because some methods may first form notes and then track the source, while others may first track the source then form notes. Any comments on this evaluation?
Best, Zhiyao
Pantaleo's Comments 27/07/2009 (University of Florence, Italy)
Hi everybody from our team,
about Zhiyao's comments, we agree with point 3) that a threshold for onset matching of 50 ms (+/- 25 ms) is quite strict, so we are favorable to set it to a larger value, for example the one suggested of 100 ms. We don't know if we are too late for this change, any other comment ?
We also agree with point 1); we think it would be suitable for all participants to know the energy threshold (or other threshold values/parameters that could affect estimation precision) used to build the ground-truth data.
Finally, we are not interested in partecipating to the sub-task 3 (Timbre Evalutaion).
Bets regards, Gianni Pantaleo
Nuno Fonseca's Comments 10/08/2009 (Polytechnic Institute of Leiria, Portugal)
Hi everyone...
I don't participate on the task with an algorithm, but I have a suggestion regarding metrics and the way results are obtained. Although Mirex metric is effective, and most of all, simple, there are factors that are not taken into consideration. I'm presenting a paper regarding a new metric on the ESCOM 2009 (12-16 August), that according to my tests present a better correlation with human perceived tests. Instead of considering only "CORRECT" or "INCORRECT" transcribed notes, it goes much deeper. In a brief note, it considers 2 methods that complement each other: a event approach (considering pitch and onsets) almost like if the instruments have a pure decay behavior; and a time approach (overlapping their "piano-rolls") almost like if the instruments have a pure sustain behavior. For instance, the decay method considers many aspects, including a note score between 0 and 100% that measures how well the note is transcribed. The full paper is available at https://jyx.jyu.fi/dspace/handle/123456789/20865?show=full
I also have a C/C++ or MATLAB implementation of it for those you want to give it a try...
Keep up with the good work, Nuno Fonseca
Mert Bay's Comments 10/08/2009 Hi Everyone, Thanks for your comment and discussion.
>1. Will the energy threshold that was used to decide if there is a pitch or not in the ground-truth data be provided to >the participants? I think this threshold will affect the precision and recall results much.
The energy thresholds used for the pitch estimation depends on the instrument and program used to extract the ground-truth of the monophonic instrument tractk. AFAIK, it was around 0.35 for PRAAT and .1 for YIN. But please be aware that those thresholds have different meanings in each program. For the note tracking part, the onset and offsets were labeled manually considering the pitch contour, spectrogram and amplitude envelope. I believe one of the reasons for the performance decrease for note tracking subtask when evaluated against onset-offset is that the offsets were labeled at very low loudness levels. Although 20% tolerance might account for it, when the solo track is mixed with other instruments, it is easily dominated.
>2. In sub-task 2 (note-level), what does the 20% threshold mean for the offset matching? Does that mean that a note is correctly estimated if the estimated offset deviates less than 20% of the note length from the ground-truth offset? Yes exactly. Let`s say you have a note with duration of 1 sec and offset at 2nd sec. The estimated offset would be true if it falls between 1.8 to 2.2 sec. If the note`s dur is shorter than 250 ms. then a threshold of +-50ms is used
>3. For sub-task 2, It seems to me that the 50ms (+- 25ms) threshold for onset matching is quite strict.
Sorry it is a typo copy-pasted from previous definition. We decided to go with +-50ms around ground-truth`s onset, and this is how it was evaluated last 2 years.
>4. For sub-task 3 (timbre-level), it seems that it's not active last year. This year we are interested in this sub-task.
Thanks for bringing this up. I believe there will be enough participants this year. We really would like to run it which would turn this task into a music transcription task. For the timbre tracking the I/O format can be exactly the same as F0 estimation but this time each column represents on source. For the note tracking we add a third column which is a number that labels the instrument source.
Thanks for the paper Nuno. Yes the mirex metrics are simple and are not supposed to evaluate the error in a perceptual way. We always support the community to use the mirex results data for different types of evaluations.
Data
A woodwind quintet transcription of the fifth variation from L. van Beethoven's Variations for String Quartet Op.18 No. 5. Each part (flute, oboe, clarinet, horn, or bassoon) was recorded separately while the performer listened to the other parts (recorded previously) through headphones. Later the parts were mixed to a monaural 44.1kHz/16bits file.
Synthesized pieces using RWC MIDI and RWC samples. Includes pieces from Classical and Jazz collections. Polyphony changes from 1 to 4 sources.
Polyphonic piano recordings generated using a disklavier playback piano.
So, there are 6, 30-sec clips for each polyphony (2-3-4-5) for a total of 30 examples, plus there are 10 30-sec polyphonic piano clips. Please email me about your estimated running time (in terms of n times realtime), if we believe everybodyΓÇÖs algorithm is fast enough, we can increase the number of test samples. (There were 90 x real-time algo`s for melody extraction tasks in the past.)
All files are in 44.1kHz / 16 bit wave format. The development set can be found at Development Set for MIREX 2007 MultiF0 Estimation Tracking Task.
Send an email to mertbay@uiuc.edu for the username and password.
Evaluation
This year, We would like to discuss different evaluation methods. From last year`s result, it can be seen that on note tracking, algorithms performed poorly when evaluated using note offsets. Below is the evaluation methods we used last year:
For Task 1 (frame level evaluation), systems will report the number of active pitches every 10ms. Precision (the portion of correct retrieved pitches for all pitches retrieved for each frame) and Recall (the ratio of correct pitches to all ground truth pitches for each frame) will be reported. A Returned Pitch is assumed to be correct if it is within a half semitone (+ - 3%) of a ground-truth pitch for that frame. Only one ground-truth pitch can be associated with each Returned Pitch. Also as suggested, an error score as described in Poliner and Ellis p.g. 5 will be calculated. The frame level ground truth will be calculated by YIN and hand corrected.
For Task 2 (note tracking), again Precision (the ratio of correctly transcribed ground truth notes to the number of ground truth notes for that input clip) and Recall (ratio of correctly transcribed ground truth notes to the number of transcribed notes) will be reported. A ground truth note is assumed to be correctly transcribed if the system returns a note that is within a half semitone (+ - 3%) of that note AND the returned note`s onset is within a 100ms range( + - 50ms) of the onset of the ground truth note, and its offset is within 20% range of the ground truth note`s offset. Again, one ground truth note can only be associated with one transcribed note.
The ground truth for this task will be annotated by hand. An amplitude threshold relative to the file/instrument will be determined. Note onset is going to be set to the time where its amplitude rises higher than the threshold and the offset is going to be set to the the time where the note`s amplitude decays lower than the threshold. The ground truth is going to be set as the average F0 between the onset and the offset of the note. In the case of legato, the onset/offset is going to be set to the time where the F0 deviates more than 3% of the average F0 through out the the note up to that point. There is not going to be any vibrato larger than a half semitone in the test data.
Different statistics can also be reported if agreed by the participants.
Submission Format
Submissions have to conform to the specified format below:
doMultiF0 "path/to/file.wav" "path/to/output/file.F0"
path/to/file.wav: Path to the input audio file.
path/to/output/file.F0: The output file.
Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.
For each task, the format of the output file is going to be different: For the first task, F0-estimation on frame basis, the output will be a file where each row has a time stamp and a number of active F0s in that frame, separated by a tab for every 10ms increments.
Example :
time F01 F02 F03 time F01 F02 F03 F04 time ... ... ... ...
which might look like:
0.78 146.83 220.00 349.23 0.79 349.23 146.83 369.99 220.00 0.80 ... ... ... ...
For the second task, for each row, the file should contain the onset, offset and the F0 of each note event separated by a tab, ordered in terms of onset times:
onset offset F01 onset offset F02 ... ... ...
which might look like:
0.68 1.20 349.23 0.72 1.02 220.00 ... ... ...
The DEADLINE is TBA.
Potential Participants
If you might consider participating, please add your name and email address here and also please sign up for the Multi-F0 mail list: Multi-F0 Estimation Tracking email list