2007:Discussion Page for Multiple Fundamental Frequency Estimation & Tracking

From MIREX Wiki

Evaluation Suggestions

Comments

chunghsin yeh

Reading the above suggestion we don't understand exactly how the contours are defined. If a contour is like a melody the problem seems ill-posed. Therefore, we suppose the different contours are related to f0 note contours. The task would then consist of multiple levels of evaluation using different data sets.

1. single frame evaluation

 using either artificially mixed monophonic samples:
 -- mixing with equal/non-equal energy
 -- random mix or musical mix
 or midi recordings as suggested above

Note, however, that even with midi recordings, the ground truth is not perfect, because note end events will not necessarily align with the end of the instruments sound, unless you are not planning to interrupt the sound. One may define a tolerance range after the note off event, where the f0 of the note may or may not be detected by the algorithms. The tolerance areas are not going to be evaluated as long as the f0 detected in this area is the correct f0 of the previous note.

2. multiple frames (tracking) evaluation

  using the midi database as above.

We're willing to share our single frame database (artificial mixtures) as well as some scripts for building the reference data.

mert bay

Thanks for you comments Chunghsin. Contour is all the F0`s generated by a single instrument. We should make this case feasible by constraining each instrument to play continously, one note at a time and each one having a distinct timbre. So the participants will not only have to extract all the F0`s per frame, also associate the extracted F0`s with the correct timbre.

Since more people are working only on estimation, we can clearly separate the evaluations (with tracking or single frame estimation) to two different tasks so that people can only perform the F0 estimation per frame basis if they don`t want to attempt tracking. No tracking score will be reported for them.

To annotate the ground truth from midis, we can synthesize each instrument separately, use a monophonic pitch detector to estimate the F0`s. Then manually verify it.

It is great that you are willing to share data. Do you have monophonic recordings of instruments playing solo passages or just single notes? If you have solo passages, we can also use it for tracking evaluation dataset. We can mix them artificially. The mix might not be musically meaningful however it will be good to obtain accurate ground truth.


chunghsin yeh

F0 contour detection as you have defined can only be done for case 1: "Multiple instruments active at the same time but each playing monophonically ..."

To our understanding, there are two levels of F0 tracking:

1. tracking note contours without taking into account the instrument timbre (possible for case 1-3)

2. tracking note contours with similar instrument timbre (for case 1 only)

Therefore, our suggestion would be to have three levels of evaluation: single frame, note contour and instrument contour.

We have artificial mixtures of single notes using databases of McGill, Iowa, IRCAM amd RWC. We also have a collection solo instrument recordings but we would like to study the rights issues.

mert bay

Yes, the instrument tracking can only be done feasibly for case 1. We can have these 3 levels of evaluation as you said, with different datasets and participants can submit their systems to whichever is suitable. I guess we won`t have problem about creating a dataset from artificial mixtures of single notes. We are planning to organize a recording session, to record each instrument separately in a quintet but at the end we`ll only have a limited amount of data. We appreciate if you can share your collection of instrument recordings. That data will not be public, only be used for evaluation purposes.

Eric Nichols

I am interested in participating, although I hope that this task ends up using real audio recordings for the competition this year. I am not particularly interested in the results of using artificially synthesized audio -- even though it may ease the creation of ground truth, it seems to invite different sorts of solutions than one would develop for the general case of polyphonic audio recognition.

I suggest that this task uses real recordings that have been annotated. This annotation can be accomplished by performing an automated match between a score/MIDI file and audio recording, and then the automated match can be hand-tuned to ensure reliable ground truth. I agree that the task should deal with monophonic audio.

I don't see why the distinction should be made between monophonic and polyphonic instruments -- a system to track multiple f0s should handle two flutes in the same mammer as a funny "polyphonic flute", just as it should handle a monophonic line on a piano as well as chords on a piano. I suppose this is because I am biased by an interest in reducing audio to "piano reduction"-type scores, and I'm not as interested in detecting the particular instruments playing each note.

I would furthur suggest starting with rather simple instrumental pieces, to facilitate creation of the annotation and to simplify the problem domain, which is very complex. For example, small chamber pieces such as duos, trios, quartets, etc., simple piano pieces and perhaps even simple examples of full orchestral music could be included.


Mert Bay

I would like to announce that we are now using an acoustic audio recording of a professional woodwind quintet (flute, bassoon, clarinet,horn, oboe) playing a 9 minute piece by Beethoven, recorded close mic`ed. Each performer is recorded on a separate track. We can get the ground truth using a monophonic pitch detector and then hand correct the contours. Combinations of 2, 3, 4, or 5 out of the 5 tracks of approx. 30 second incipits can be used as our dataset. I`ll upload a portion of the data as a development set once the the recordings are completed.


The polyphonic instruments will cause problems with most trackers, although there are no problems for F0 estimation (then evaluate on frame by frame basis). Do you suggest that the systems should output the F0`s of the polyphonic instrument in on track? We can create another category for that if more than 3 people want their system to be evaluated in that category.

The reason for evaluation based on note contours is that some systems which track according to the continuity of the spectrum tend to lock onto a note of a particular instrument without taking into account the timbre but when the note ends and its energy gets weaker, the system will start to track another note of another intrument. If your system can perform tracking on the instrument level fine, then it shouldn`t be a problem for you to segment the note boundaries. Ofcourse, you don`t have to be evaluated on that if you don`t want to.

As a discussed above, for right now three levels of evaluation is being considered

1. Evaluate the F0 esimators: Frame by frame basis. Systems should output the active frequencies in each frame. This dataset can have polyphonic instruments.

2. Evaluate F0 trackers on the note level. Each system should output an F0 contour for each note from each instrument. This data can also have polyphonic instruments

3. Evaluate F0 trackers on the instrument level. Each system should output the F0 contour of each instrument.

Eric Nichols

Thanks for the clarification. I still have a few more questions. I'm mostly interested in version 1 of the task above, if I understand things correctly. Our algorithm outputs a set of sounding notes at each frame, without tracking individual instruments. Please correct me if this is wrong, but I think that this means that task 1 is the correct one for this system.

I also would like to know what is meant by frequency tracking -- our system outputs symbolic sets of sounding notes (i.e. MIDI numbers), not actual frequency in Hz. Is this task concerned with tracking the fine-tuning of each note over time?

Finally, in the proposed task, are the instruments known ahead of time, for use in training/tuning algorithms for those particular sounds? For instance, should the participating systems be tuned for flute, bassoon, clarinet, horn, and oboe?

Mert Bay

Since your algorithm peforms only F0 estimation not tracking, first evaluation criteria is suitable for you. For the first criteria, the estimated F0s will be evaluated as correct if it is in a semitone range (+ or - quarter tone) of the ground truth F0s for that frame. Every reported F0 will be checked against the closest F0 in the ground truth. The algorithms will report a set of F0s every 10 ms. I suggest, the systems should know the total number of sources in the piece before hand. However each source will not be active in every frame. For the second criteria, we have to decide on a reasonable range for onset / offset timings for the reported notes as opposed the ground truth timings. Third criteria will be evaluated like the first one except this time each reported F0-track for the whole piece will be evaluated against the closest ground truth F0 track. If the instrument is not active, the system should report an F0 of 0Hz, or midi note number 0.

For the 3rd evaluation criteria, the systems will have the oppurtunity to be tuned for the instruments which is not necessary for F0 estimation. That`s why we will soon release a small training set of individual tracks of a woodwind quintet playing a highly counterpuntal piece.

graham poliner

Thanks for organizing the multiple f0 evaluation Mert. We certainly appreciate how much work it takes to develop such a test set. We will likely participate in the first two subtasks (our algorithm estimates discrete notes without regard for instrument type), but we have a few questions:

Regarding the frame-level evaluation metric, is it essential to provide the maximum number of sources (f0's in the above context) as an input to the algorithms? It may simplify the problem beyond what is necessary. If I understand the proposed evaluation metric correctly, the definition of TN is a function of the total number of potential voices; however, this may lead to biases in the algorithms based on the specific music in the test set. For example, in the case where an instrument/note is almost always 'off', the submitted algorithms would be rewarded for underestimating the number of frames in which the f0 is voiced and visa versa. The proposed metric has the benefit that it is bounded by 0 and 1, but there are alternative metrics that don't require knowledge of the total number of potential voiced f0s. We would be in favor of reporting error metrics, rather than pseudo-accuracy, as defined in the NIST rich transcription meeting and Poliner and Ellis. (The numbers may be generally lower, but they will likely be more informative.) We could also report a frame-level version of the accuracy metric proposed by Dixon which allows us to report a metric bounded by 0 and 1 without knowledge of the total number of possible voices.

For the note-level evaluation metric, perhaps we could more formally state the the definition of note onset/offset rather than to be "manually determined" by a musician. We could use some fraction of the relative energy of the note and/or the formal definition developed for the onset detection task. In addition, the offset is of much less perceptual importance, and for instruments like a piano, a great deal of energy persists long after the note is over according to the musical score when the pedals are used. As such, we would be in favor of a formal definition and tight tolerance (e.g. 50 ms) for the onset, but with a more relaxed tolerance for the offset -- something like the metric proposed by Ryynanen and Klapuri in which a note is marked correctly transcribed when the onset is within a fixed tolerance window and the offset is within a (relative) fraction of the length of the note. How will issues such as vibrato and glissando be treated for the note-level evaluation?

Finally, we have a test set of polyphonic piano recordings (44kHz, mono) with aligned MIDI ground truth that we would be glad to contribute to the evaluation. We'd also be willing to help annotate in order to avoid using artificial mixtures or synthesized recordings in the evaluation.

mert bay

Thanks for your comments Graham. Sorry for the late reply, I have been traveling recently and do not have access to internet very often, will go back to normal on july 23rd. It is true that in the above metric true negatives are a function of the maximum number of voices. Underestimating the voiced f0s will increase the number of TN`s, However it will also increase the number of FN`s. We are open to different evaluation metrics, we (imirsel and participants) have to discuss and come up with an agreement soon. I`ll check out the references you`ve sent. For the note level evaluation, having a fixed tolerance for the onset and relative tolerance for the offset makes sense. For the ground truth, we can set a relative amplitude threshold for each note`s onset and offset. We will only include audio with less than one semitone vibrato and no glissando in the test set for note-level evaluation. We really appreciate if you can share some data from your polyphonic piano recordings.

Pierre Leveau (and participants from ENST)

Mert, thank you for organizing this contest. It is indeed quite difficult to get all the evaluations of the multi-f0 estimation "harmonized". Sorry for the late contribution to the discussion, we hope it is not too late. After some discussions with potential ENST participants, we would like to have some information about the following points:

- Will the instruments be finally restricted to the subset you mentioned (i.e. clarinet, oboe, flute, horn, bassoon)? In this case, one could train instrument-specific models to perform the tasks. The piano has been mentioned in the comments. In our opinion, the 3rd task cannot be performed in a straightforward manner if the piano is involved, even alone in the musical piece. Finally, why not create 5 "subsubtasks" (I-mono instru, II-mono instru., III-mono instru., I-piano, II-piano)? Many people have worked only on the piano, others only on monophonic instruments, and some of them may fear to submit algorithms for multi-f0 detection in a too open context. I guess it would demand some extra work and discussions, it is only a suggestion.

- We wonder about the evaluation of the second task. Depending on the evaluation protocol, the algorithms will not be tuned in the same manner. We find the symbolic evaluation more relevant than the overlap one, because it is more consistent with the goals of WAV2MIDI systems that are being developed.

- Fixing the maximum polyphony before the submission is a good idea. 5-voices polyphony seems to be difficult enough to provide some challenge.

- It the instruments are not restricted to the aforementioned subsets, which F0_min and F0_max can be expected?

mert bay

Hi Pierre, it is great that we are having more participants. To answer your questions.

- Only the third task, "the timbre tracking" part will be restricted to the subset you wrote to make it more feasible. Participants can use the training set which includes the solo instrument recordings to train their algorithms on specific instruments. For the frame level and note level evaluation, we`ll have different music. I agree it is a good idea to have a polyphonic piano subtask, since there is a lot of work using on piano transcription using NMF like methods. Also we`ll get piano recordings from Graham, recorded using a disklavier playback piano so the ground truth is ready. I am not sure about monophonic pitch detection since most monophonic pitch detections systems are already very accurate. However as a MIREX rule, if there are more than 3 participants interested in that, we can run it.

-The second task is going to be evaluated symbolically. I`ll update the evaluation part soon.

- I think we should have a poll about this among the participants, I suggest the algorithms take max polyphony as input and do their processing according to that. At the end we report statistics on different polyphony levels (from 2-5) for each algorithm.

- 60Hz to 2kHz.

Emmanuel Vincent

Hi all! I agree with Graham's comments about evaluation issues.

Our system is based on NMF, so this would be great if test excerpts were somewhat long (the 30s duration mentioned above is perfect).

Also I am not sure what is the correspondence between time instants and pitch annotations in the development data:

  • is the first 10ms time frame centered at t=0s (i.e. the first annotation corresponds to t=0s and the last to t=53.99s)?
  • or does the first 10ms time frame start at t=0s (i.e. the first annotation corresponds to t=0.005s and the last to 53.995s)?

Please also mention the maximum computation time allowed.

Mert Bay

Hi Emmanuel. Actually, the window size that generated the ground truth was 46ms. So the first two windows were note centered at 10ms. The centers were at 23ms,33ms,43ms..... since there is some silence in the beginning and end , there shouldn`t be any difference. You can use whatever window size you like, the lowest F0 in the data will be 55Hz. The maximum computation time allowed will be 30 hours for everything.

Emmanuel Vincent

Two remarks about the evaluation measures:

1) I suggest that the correct onset detection threshold be defined as +/- 50 ms instead of +/- 25 ms. Indeed a +/- 50 ms threshold was already chosen for the MIREX 2006 onset detection task. There are two reasons for this:

  • string instruments have weak onsets that cannot be defined precisely
  • the 46 ms window used for ground truth transcription induces a low precision on ground truth onsets on the order of +/- 23 ms

2) I agree with the principle of the 20% note duration threshold for correct offset detection. However this threshold will be smaller than the one used for onsets for notes shorter than 250 ms (which happens quite often). I suggest that this threshold be replaced by the maximum of 20% note duration and 50 ms.

mert bay

Thanks for you comments Chunghsin. Contour is all the F0`s generated by a single instrument. We should make this case feasible by constraining each instrument to play continously, one note at a time and each one having a distinct timbre. So the participants will not only have to extract all the F0`s per frame, also associate the extracted F0`s with the correct timbre.

Since more people are working only on estimation, we can clearly separate the evaluations (with tracking or single frame estimation) to two different tasks so that people can only perform the F0 estimation per frame basis if they don`t want to attempt tracking. No tracking score will be reported for them.

To annotate the ground truth from midis, we can synthesize each instrument separately, use a monophonic pitch detector to estimate the F0`s. Then manually verify it.

It is great that you are willing to share data. Do you have monophonic recordings of instruments playing solo passages or just single notes? If you have solo passages, we can also use it for tracking evaluation dataset. We can mix them artificially. The mix might not be musically meaningful however it will be good to obtain accurate ground truth.


chunghsin yeh

F0 contour detection as you have defined can only be done for case 1: "Multiple instruments active at the same time but each playing monophonically ..."

To our understanding, there are two levels of F0 tracking:

1. tracking note contours without taking into account the instrument timbre (possible for case 1-3)

2. tracking note contours with similar instrument timbre (for case 1 only)

Therefore, our suggestion would be to have three levels of evaluation: single frame, note contour and instrument contour.

We have artificial mixtures of single notes using databases of McGill, Iowa, IRCAM amd RWC. We also have a collection solo instrument recordings but we would like to study the rights issues.

mert bay

Yes, the instrument tracking can only be done feasibly for case 1. We can have these 3 levels of evaluation as you said, with different datasets and participants can submit their systems to whichever is suitable. I guess we won`t have problem about creating a dataset from artificial mixtures of single notes. We are planning to organize a recording session, to record each instrument separately in a quintet but at the end we`ll only have a limited amount of data. We appreciate if you can share your collection of instrument recordings. That data will not be public, only be used for evaluation purposes.

Eric Nichols

I am interested in participating, although I hope that this task ends up using real audio recordings for the competition this year. I am not particularly interested in the results of using artificially synthesized audio -- even though it may ease the creation of ground truth, it seems to invite different sorts of solutions than one would develop for the general case of polyphonic audio recognition.

I suggest that this task uses real recordings that have been annotated. This annotation can be accomplished by performing an automated match between a score/MIDI file and audio recording, and then the automated match can be hand-tuned to ensure reliable ground truth. I agree that the task should deal with monophonic audio.

I don't see why the distinction should be made between monophonic and polyphonic instruments -- a system to track multiple f0s should handle two flutes in the same mammer as a funny "polyphonic flute", just as it should handle a monophonic line on a piano as well as chords on a piano. I suppose this is because I am biased by an interest in reducing audio to "piano reduction"-type scores, and I'm not as interested in detecting the particular instruments playing each note.

I would furthur suggest starting with rather simple instrumental pieces, to facilitate creation of the annotation and to simplify the problem domain, which is very complex. For example, small chamber pieces such as duos, trios, quartets, etc., simple piano pieces and perhaps even simple examples of full orchestral music could be included.


Mert Bay

I would like to announce that we are now using an acoustic audio recording of a professional woodwind quintet (flute, bassoon, clarinet,horn, oboe) playing a 9 minute piece by Beethoven, recorded close mic`ed. Each performer is recorded on a separate track. We can get the ground truth using a monophonic pitch detector and then hand correct the contours. Combinations of 2, 3, 4, or 5 out of the 5 tracks of approx. 30 second incipits can be used as our dataset. I`ll upload a portion of the data as a development set once the the recordings are completed.


The polyphonic instruments will cause problems with most trackers, although there are no problems for F0 estimation (then evaluate on frame by frame basis). Do you suggest that the systems should output the F0`s of the polyphonic instrument in on track? We can create another category for that if more than 3 people want their system to be evaluated in that category.

The reason for evaluation based on note contours is that some systems which track according to the continuity of the spectrum tend to lock onto a note of a particular instrument without taking into account the timbre but when the note ends and its energy gets weaker, the system will start to track another note of another intrument. If your system can perform tracking on the instrument level fine, then it shouldn`t be a problem for you to segment the note boundaries. Ofcourse, you don`t have to be evaluated on that if you don`t want to.

As a discussed above, for right now three levels of evaluation is being considered

1. Evaluate the F0 esimators: Frame by frame basis. Systems should output the active frequencies in each frame. This dataset can have polyphonic instruments.

2. Evaluate F0 trackers on the note level. Each system should output an F0 contour for each note from each instrument. This data can also have polyphonic instruments

3. Evaluate F0 trackers on the instrument level. Each system should output the F0 contour of each instrument.

Eric Nichols

Thanks for the clarification. I still have a few more questions. I'm mostly interested in version 1 of the task above, if I understand things correctly. Our algorithm outputs a set of sounding notes at each frame, without tracking individual instruments. Please correct me if this is wrong, but I think that this means that task 1 is the correct one for this system.

I also would like to know what is meant by frequency tracking -- our system outputs symbolic sets of sounding notes (i.e. MIDI numbers), not actual frequency in Hz. Is this task concerned with tracking the fine-tuning of each note over time?

Finally, in the proposed task, are the instruments known ahead of time, for use in training/tuning algorithms for those particular sounds? For instance, should the participating systems be tuned for flute, bassoon, clarinet, horn, and oboe?

Mert Bay

Since your algorithm peforms only F0 estimation not tracking, first evaluation criteria is suitable for you. For the first criteria, the estimated F0s will be evaluated as correct if it is in a semitone range (+ or - quarter tone) of the ground truth F0s for that frame. Every reported F0 will be checked against the closest F0 in the ground truth. The algorithms will report a set of F0s every 10 ms. I suggest, the systems should know the total number of sources in the piece before hand. However each source will not be active in every frame. For the second criteria, we have to decide on a reasonable range for onset / offset timings for the reported notes as opposed the ground truth timings. Third criteria will be evaluated like the first one except this time each reported F0-track for the whole piece will be evaluated against the closest ground truth F0 track. If the instrument is not active, the system should report an F0 of 0Hz, or midi note number 0.

For the 3rd evaluation criteria, the systems will have the oppurtunity to be tuned for the instruments which is not necessary for F0 estimation. That`s why we will soon release a small training set of individual tracks of a woodwind quintet playing a highly counterpuntal piece.

graham poliner

Thanks for organizing the multiple f0 evaluation Mert. We certainly appreciate how much work it takes to develop such a test set. We will likely participate in the first two subtasks (our algorithm estimates discrete notes without regard for instrument type), but we have a few questions:

Regarding the frame-level evaluation metric, is it essential to provide the maximum number of sources (f0's in the above context) as an input to the algorithms? It may simplify the problem beyond what is necessary. If I understand the proposed evaluation metric correctly, the definition of TN is a function of the total number of potential voices; however, this may lead to biases in the algorithms based on the specific music in the test set. For example, in the case where an instrument/note is almost always 'off', the submitted algorithms would be rewarded for underestimating the number of frames in which the f0 is voiced and visa versa. The proposed metric has the benefit that it is bounded by 0 and 1, but there are alternative metrics that don't require knowledge of the total number of potential voiced f0s. We would be in favor of reporting error metrics, rather than pseudo-accuracy, as defined in the NIST rich transcription meeting and Poliner and Ellis. (The numbers may be generally lower, but they will likely be more informative.) We could also report a frame-level version of the accuracy metric proposed by Dixon which allows us to report a metric bounded by 0 and 1 without knowledge of the total number of possible voices.

For the note-level evaluation metric, perhaps we could more formally state the the definition of note onset/offset rather than to be "manually determined" by a musician. We could use some fraction of the relative energy of the note and/or the formal definition developed for the onset detection task. In addition, the offset is of much less perceptual importance, and for instruments like a piano, a great deal of energy persists long after the note is over according to the musical score when the pedals are used. As such, we would be in favor of a formal definition and tight tolerance (e.g. 50 ms) for the onset, but with a more relaxed tolerance for the offset -- something like the metric proposed by Ryynanen and Klapuri in which a note is marked correctly transcribed when the onset is within a fixed tolerance window and the offset is within a (relative) fraction of the length of the note. How will issues such as vibrato and glissando be treated for the note-level evaluation?

Finally, we have a test set of polyphonic piano recordings (44kHz, mono) with aligned MIDI ground truth that we would be glad to contribute to the evaluation. We'd also be willing to help annotate in order to avoid using artificial mixtures or synthesized recordings in the evaluation.

mert bay

Thanks for your comments Graham. Sorry for the late reply, I have been traveling recently and do not have access to internet very often, will go back to normal on july 23rd. It is true that in the above metric true negatives are a function of the maximum number of voices. Underestimating the voiced f0s will increase the number of TN`s, However it will also increase the number of FN`s. We are open to different evaluation metrics, we (imirsel and participants) have to discuss and come up with an agreement soon. I`ll check out the references you`ve sent. For the note level evaluation, having a fixed tolerance for the onset and relative tolerance for the offset makes sense. For the ground truth, we can set a relative amplitude threshold for each note`s onset and offset. We will only include audio with less than one semitone vibrato and no glissando in the test set for note-level evaluation. We really appreciate if you can share some data from your polyphonic piano recordings.

Pierre Leveau (and participants from ENST)

Mert, thank you for organizing this contest. It is indeed quite difficult to get all the evaluations of the multi-f0 estimation "harmonized". Sorry for the late contribution to the discussion, we hope it is not too late. After some discussions with potential ENST participants, we would like to have some information about the following points:

- Will the instruments be finally restricted to the subset you mentioned (i.e. clarinet, oboe, flute, horn, bassoon)? In this case, one could train instrument-specific models to perform the tasks. The piano has been mentioned in the comments. In our opinion, the 3rd task cannot be performed in a straightforward manner if the piano is involved, even alone in the musical piece. Finally, why not create 5 "subsubtasks" (I-mono instru, II-mono instru., III-mono instru., I-piano, II-piano)? Many people have worked only on the piano, others only on monophonic instruments, and some of them may fear to submit algorithms for multi-f0 detection in a too open context. I guess it would demand some extra work and discussions, it is only a suggestion.

- We wonder about the evaluation of the second task. Depending on the evaluation protocol, the algorithms will not be tuned in the same manner. We find the symbolic evaluation more relevant than the overlap one, because it is more consistent with the goals of WAV2MIDI systems that are being developed.

- Fixing the maximum polyphony before the submission is a good idea. 5-voices polyphony seems to be difficult enough to provide some challenge.

- It the instruments are not restricted to the aforementioned subsets, which F0_min and F0_max can be expected?

mert bay

Hi Pierre, it is great that we are having more participants. To answer your questions.

- Only the third task, "the timbre tracking" part will be restricted to the subset you wrote to make it more feasible. Participants can use the training set which includes the solo instrument recordings to train their algorithms on specific instruments. For the frame level and note level evaluation, we`ll have different music. I agree it is a good idea to have a polyphonic piano subtask, since there is a lot of work using on piano transcription using NMF like methods. Also we`ll get piano recordings from Graham, recorded using a disklavier playback piano so the ground truth is ready. I am not sure about monophonic pitch detection since most monophonic pitch detections systems are already very accurate. However as a MIREX rule, if there are more than 3 participants interested in that, we can run it.

-The second task is going to be evaluated symbolically. I`ll update the evaluation part soon.

- I think we should have a poll about this among the participants, I suggest the algorithms take max polyphony as input and do their processing according to that. At the end we report statistics on different polyphony levels (from 2-5) for each algorithm.

- 60Hz to 2kHz.

Emmanuel Vincent

Hi all! I agree with Graham's comments about evaluation issues.

Our system is based on NMF, so this would be great if test excerpts were somewhat long (the 30s duration mentioned above is perfect).

Also I am not sure what is the correspondence between time instants and pitch annotations in the development data:

  • is the first 10ms time frame centered at t=0s (i.e. the first annotation corresponds to t=0s and the last to t=53.99s)?
  • or does the first 10ms time frame start at t=0s (i.e. the first annotation corresponds to t=0.005s and the last to 53.995s)?

Please also mention the maximum computation time allowed.

Mert Bay

Hi Emmanuel. Actually, the window size that generated the ground truth was 46ms. So the first two windows were note centered at 10ms. The centers were at 23ms,33ms,43ms..... since there is some silence in the beginning and end , there shouldn`t be any difference. You can use whatever window size you like, the lowest F0 in the data will be 55Hz. The maximum computation time allowed will be 30 hours for everything.

Emmanuel Vincent

Two remarks about the evaluation measures:

1) I suggest that the correct onset detection threshold be defined as +/- 50 ms instead of +/- 25 ms. Indeed a +/- 50 ms threshold was already chosen for the MIREX 2006 onset detection task. There are two reasons for this:

  • string instruments have weak onsets that cannot be defined precisely
  • the 46 ms window used for ground truth transcription induces a low precision on ground truth onsets on the order of +/- 23 ms

2) I agree with the principle of the 20% note duration threshold for correct offset detection. However this threshold will be smaller than the one used for onsets for notes shorter than 250 ms (which happens quite often). I suggest that this threshold be replaced by the maximum of 20% note duration and 50 ms.

Andreas Ehmann

We did a quick experiment by running an f0 estimator at a 10 ms hopsize and an 11.61ms (512 samples at 44.1kHz) hopsize. The 11.61ms hopsize result was then resampled by choosing the f0 at the closest time stamp to the 10ms grid (e.g. 92.88 ms in the original would be applied to 90ms in the new resampled output). The figure of both f0 contours can be seen above. The result was then evaluated using the melody extraction evaluator. Of the 493 frames in the recording, only 1 was evaluated as incorrect (99.8% accuracy).

File:F0s.jpg Plot of F0 reported every 10ms(441samples at 44.1kHZ) vs F0 reported every 11.61ms(512 samples) resampled by picking the closest time stamp to the 10 ms grid.

Resampling from a 23 ms hopsize resulted in roughly a 5% error, with virtually all the error existing in the legato/glissando regions of the f0 contours. Therefore it is suggested to stick to hopsizes that are relatively close to 10ms.

Emmanuel Vincent

Thanks Andreas for this experiment!

It seems that the "glissando regions of the f0 contours" are due to inaccurate ground truth transcription. They are not related to actual glissando notes, but to temporal overlap between the decay of one note and the attack of the subsequent note.

Hence this experiment does not clearly prove that a 23ms hopsize provides a 5% error rate, since this amount of error rate is likely to be non significant.