Difference between revisions of "2024:Polyphonic Transcription"
(→Additional procedure for Mitigating systematic biases in specific evaluation datasets) |
(→Submission) |
||
(39 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
− | We are introducing polyphonic transcription as a MIREX task for the first time. This year's focus is on '''piano transcription''' (audio-to-MIDI). For piano transcription, our task is designed to be compatible with most published papers on the topic ( | + | We are introducing polyphonic transcription as a MIREX task for the first time. This year's focus is on '''piano transcription''' (audio-to-MIDI). For piano transcription, our task is designed to be compatible with most published papers on the topic (e.g., [[#References| papers in the reference list]]), aiming to standardize evaluation in the field, fixing the pervasive error in published papers of using evaluation datasets beyond MAESTRO. (i.e., [[#Additional procedure for mitigating systematic biases in specific evaluation datasets| fixing the alignment bias]]). |
− | |||
− | |||
== Task Description == | == Task Description == | ||
Line 15: | Line 13: | ||
'''Participants' systems must output standard MIDI format (.mid) files'''. This can be achieved by converting a list of (onset, offset, pitch, velocity) tuples using libraries such as prettyMIDI [https://github.com/craffel/pretty-midi]. | '''Participants' systems must output standard MIDI format (.mid) files'''. This can be achieved by converting a list of (onset, offset, pitch, velocity) tuples using libraries such as prettyMIDI [https://github.com/craffel/pretty-midi]. | ||
− | == Note | + | == Note duration extension according to pedals == |
− | Most published papers on audio-to-MIDI automatic piano transcription follow | + | Most published papers on audio-to-MIDI automatic piano transcription follow the following convention: '''when a note's offset occurs while the sustain pedal is active, the note's offset is extended to either the pedal release or the onset of the next note of the same pitch, whichever comes first.''' |
+ | For example, if a C4 note ends at 2.0 seconds but the sustain pedal is active until 3.0 seconds, the note's duration would be extended to 3.0 seconds (or earlier if another C4 note starts before 3.0 seconds). | ||
+ | |||
Traditionally, these extended notes are used for both training and evaluation. We will use pedal-extended notes for evaluation to maintain consistency with existing literature. However, we recognize that note duration extension may be redundant if sustain pedals are accurately transcribed. | Traditionally, these extended notes are used for both training and evaluation. We will use pedal-extended notes for evaluation to maintain consistency with existing literature. However, we recognize that note duration extension may be redundant if sustain pedals are accurately transcribed. | ||
Line 28: | Line 28: | ||
* Transcribe raw MIDI notes and sustain pedal events | * Transcribe raw MIDI notes and sustain pedal events | ||
During evaluation, notes from all three types of submissions will be comparable, as transcribed pedal information will be used to extend the transcribed notes when necessary. | During evaluation, notes from all three types of submissions will be comparable, as transcribed pedal information will be used to extend the transcribed notes when necessary. | ||
+ | |||
+ | Additionally, we will compute note metrics only for notes without pedal extension, if any submission falls into the following two categories: | ||
+ | * Transcribe raw MIDI notes and sustain pedal events | ||
+ | * Transcribe raw MIDI notes only | ||
== Dataset == | == Dataset == | ||
=== Training Data === | === Training Data === | ||
− | * Participants must train their models using the '''official training split of the MAESTRO dataset [https://magenta.tensorflow.org/datasets/maestro]'''. | + | * Participants must train their models using the '''official training split of the MAESTRO v3.0.0 dataset [https://magenta.tensorflow.org/datasets/maestro]'''. |
− | * The validation split of MAESTRO can be used for hyperparameter tuning and model selection. | + | * The validation split of MAESTRO v3.0.0 can be used for hyperparameter tuning and model selection. |
* '''Data augmentation is allowed''' during training. This differs from many published papers but is permitted as it has been shown to improve generalization to datasets beyond MAESTRO. | * '''Data augmentation is allowed''' during training. This differs from many published papers but is permitted as it has been shown to improve generalization to datasets beyond MAESTRO. | ||
Line 39: | Line 43: | ||
Evaluation will be performed on the following datasets: | Evaluation will be performed on the following datasets: | ||
− | # MAESTRO test split | + | # MAESTRO v3.0.0 test split (177 audio files) |
− | # Acoustic recordings from MAPS dataset [https://adasp.telecom-paris.fr/resources/2010-07-08-maps-database/]: | + | # Acoustic recordings from MAPS dataset [https://adasp.telecom-paris.fr/resources/2010-07-08-maps-database/] (60 audio files): |
#* ENSTDkCl/MUS subset | #* ENSTDkCl/MUS subset | ||
#* ENSTDkAm/MUS subset | #* ENSTDkAm/MUS subset | ||
− | # SMD-piano dataset | + | # SMD-piano dataset [https://zenodo.org/records/13753319 version 2] (50 audio files) |
− | |||
− | |||
=== Data Usage Guidelines === | === Data Usage Guidelines === | ||
Line 58: | Line 60: | ||
We follow the standard piano transcription evaluation procedure, as in (Hawthorne et al. 2018): | We follow the standard piano transcription evaluation procedure, as in (Hawthorne et al. 2018): | ||
=== Activation-level (frame-level) metrics === | === Activation-level (frame-level) metrics === | ||
− | * Precision, recall, and F1 measures | + | * We compute Precision, recall, and F1 measures |
* These metrics assess how accurately the predicted pitch activations match the ground truth. | * These metrics assess how accurately the predicted pitch activations match the ground truth. | ||
* Instead of traditional time-axis discretization, we compute these metrics in continuous time to avoid bias against specific model hop sizes. It is equivalent to frame-level metrics with an infinitesimal hop size. | * Instead of traditional time-axis discretization, we compute these metrics in continuous time to avoid bias against specific model hop sizes. It is equivalent to frame-level metrics with an infinitesimal hop size. | ||
Line 67: | Line 69: | ||
# '''Note''': A note prediction is considered correct if its estimated onset is within 50ms of the corresponding ground truth onset | # '''Note''': A note prediction is considered correct if its estimated onset is within 50ms of the corresponding ground truth onset | ||
# '''Note with offset''': In addition to the onset criterion, the offset of the estimated note must be within 50ms or 20% of the note duration from the ground truth offset, whichever is greater | # '''Note with offset''': In addition to the onset criterion, the offset of the estimated note must be within 50ms or 20% of the note duration from the ground truth offset, whichever is greater | ||
− | # '''Note with offset and velocity''': In addition to onset and offset criteria, the normalized estimated midi velocity must be within | + | # '''Note with offset and velocity''': In addition to onset and offset criteria, the normalized estimated midi velocity must be within 10% of the ground truth velocity. This normalized estimated velocity is described in (Hawthorne et al 2018) that uses linear regression to rescale and shift the estimated velocity to match the groudtruth. |
In addition to these classification based metrics, we also compute the mean onset/offset deviations (in milliseconds) for matched notes between predicted and ground truth notes. | In addition to these classification based metrics, we also compute the mean onset/offset deviations (in milliseconds) for matched notes between predicted and ground truth notes. | ||
Line 75: | Line 77: | ||
=== Correcting alignment errors === | === Correcting alignment errors === | ||
− | As demonstrated in (Yan et al. 2024), | + | As demonstrated in (Yan et al. 2024), systematic alignment errors in MAPS may change the conclusion of comparing different models. The existence of certain systematic alignment errors may assign lower score for a model with higher accuracy and timing precision. |
− | + | These biases, often resulting from recording setup or data processing methods, can lead to systematic errors in note timings. Correcting these biases ensures a fair comparison between different transcription systems. | |
− | To | + | To mitigate the alignment issue in the evaluation datasets: |
− | # | + | # MAPS: We apply a piece-specific alignment correction before evaluation. |
− | # | + | # SMD: We use an updated version of the dataset with corrected alignments (Version 2). |
== Submission == | == Submission == | ||
− | + | To simplify the submission process, participants are only required to submit their transcription results in MIDI format (.mid). | |
+ | Submission Guidelines: | ||
+ | |||
+ | # Transcribe each audio file in the evaluation set to a corresponding MIDI file. | ||
+ | # Name each MIDI file exactly the same as the original audio file, but replace the audio extension with .mid. | ||
+ | # Organize your submission in folders: | ||
+ | #* Create a folder for each dataset: "maestro", "MAPS", and "SMD" | ||
+ | #* Place all MIDI files for each dataset in its respective folder | ||
+ | # If your system does not output notes with extended duration according to pedals: | ||
+ | #* Add "_no_ext" to the folder names | ||
+ | #* Example: "maestro_no_ext", "MAPS_no_ext", "SMD_no_ext" | ||
+ | # Similarly, if you system does not output pedals, add postfix "_no_pedal" to the folder names | ||
+ | # Include an extended abstract for briefly describe the design of your system. Format: PDF of 2-4 pages. '''If it comes from a published paper, you can choose to include that paper directly''' | ||
+ | # Put all folders into a single .zip file, name it with the identifier you want for your system. | ||
+ | |||
+ | |||
+ | |||
+ | Your final submission structure should look like this: | ||
+ | |||
+ | <pre> | ||
+ | teamA.zip/ | ||
+ | ├── maestro/ (or maestro_no_ext/) | ||
+ | │ ├── MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_08_Track08_wav.mid | ||
+ | │ ├── MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_10_Track10_wav.mid | ||
+ | │ └── ... | ||
+ | ├── MAPS/ (or MAPS_no_ext/) | ||
+ | │ ├── MAPS_MUS-alb_se2_ENSTDkCl.mid | ||
+ | │ ├── MAPS_MUS-bk_xmas1_ENSTDkAm.mid | ||
+ | │ └── ... | ||
+ | ├── SMD/ (or SMD_no_ext/) | ||
+ | │ ├── Bach_BWV849-01_001_20090916-SMD.mid | ||
+ | │ ├── Bach_BWV849-02_001_20090916-SMD.mid | ||
+ | │ └── ... | ||
+ | └── teamA.pdf (extended abstract, 2-4 pages, or the published paper about the system) | ||
+ | </pre> | ||
+ | |||
+ | Important Notes: | ||
+ | |||
+ | * This is the first year for MIREX to restart the Polyphonic Transcription task. | ||
+ | * We welcome submissions from both new systems and published papers. | ||
+ | * Authors of published papers are encouraged to submit their results to help test the entire evaluation procedure. | ||
+ | * If submitting results from a published paper, you may include the full paper as your extended abstract. | ||
+ | * Your participation will help us refine and improve the evaluation process for future iterations of this task. | ||
+ | |||
+ | If you have any questions about the submission process or need assistance, please don't hesitate to contact the task organizers. | ||
== Questions or Suggestions == | == Questions or Suggestions == | ||
− | + | If you have any question, or you have difficulty accessing any dataset for evaluation for this task, contact: | |
* Yujia Yan: yujia.yan<at>rochester.edu | * Yujia Yan: yujia.yan<at>rochester.edu | ||
* Ziyu Wang: ziyu.wang<at>nyu.edu | * Ziyu Wang: ziyu.wang<at>nyu.edu | ||
− | |||
== References == | == References == | ||
Line 96: | Line 141: | ||
* Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck. "Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset". In: ''Proceedings of the International Conference on Learning Representations (ICLR)''. 2019. | * Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck. "Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset". In: ''Proceedings of the International Conference on Learning Representations (ICLR)''. 2019. | ||
* Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang. "High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times". In: ''IEEE/ACM Transactions on Audio, Speech, and Language Processing'' 29 (2020), pp. 3707-3717. | * Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang. "High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times". In: ''IEEE/ACM Transactions on Audio, Speech, and Language Processing'' 29 (2020), pp. 3707-3717. | ||
+ | * Weixing Wei, Peilin Li, Yi Yu, Wei Li. "HPPNet: Modeling the Harmonic Structure and Pitch Invariance in Piano Transcription". In: ''Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR)''. Bengaluru, India, 2022, pp. 709-716. | ||
* Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Weimin Liao, Yuki Mitsufuji. "Automatic Piano Transcription with Hierarchical Frequency-Time Transformer". In: ''Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)''. 2023. | * Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Weimin Liao, Yuki Mitsufuji. "Automatic Piano Transcription with Hierarchical Frequency-Time Transformer". In: ''Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)''. 2023. | ||
* Yujia Yan, Frank Cwitkowitz, Zhiyao Duan. "Skipping the Frame-Level: Event-Based Piano Transcription With Neural Semi-CRFs". In: ''Advances in Neural Information Processing Systems''. 2021. | * Yujia Yan, Frank Cwitkowitz, Zhiyao Duan. "Skipping the Frame-Level: Event-Based Piano Transcription With Neural Semi-CRFs". In: ''Advances in Neural Information Processing Systems''. 2021. | ||
* Yujia Yan, Zhiyao Duan. "Scoring Time Intervals Using Non-Hierarchical Transformer for Automatic Piano Transcription". (To Appear) In: ''Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)''. 2024. | * Yujia Yan, Zhiyao Duan. "Scoring Time Intervals Using Non-Hierarchical Transformer for Automatic Piano Transcription". (To Appear) In: ''Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)''. 2024. |
Latest revision as of 15:28, 4 October 2024
Contents
Introduction
We are introducing polyphonic transcription as a MIREX task for the first time. This year's focus is on piano transcription (audio-to-MIDI). For piano transcription, our task is designed to be compatible with most published papers on the topic (e.g., papers in the reference list), aiming to standardize evaluation in the field, fixing the pervasive error in published papers of using evaluation datasets beyond MAESTRO. (i.e., fixing the alignment bias).
Task Description
The goal of piano transcription is to convert solo piano recordings into their symbolic representation, specifically the MIDI format. Participants must develop systems that can:
- Extract notes with correct onset, offset, pitch and MIDI velocity
- Extract sustain pedal (CC64) events, which are crucial for transcribing expressive performances (sostenuto and soft pedals are excluded for simplicity, as they are relatively rare in the MAESTRO dataset and less commonly reported in published papers)
For simplicity, sustain pedal events are binarized: CC values ≥ 64 represent 'on', and < 64 represent 'off'. Participants' systems must output standard MIDI format (.mid) files. This can be achieved by converting a list of (onset, offset, pitch, velocity) tuples using libraries such as prettyMIDI [1].
Note duration extension according to pedals
Most published papers on audio-to-MIDI automatic piano transcription follow the following convention: when a note's offset occurs while the sustain pedal is active, the note's offset is extended to either the pedal release or the onset of the next note of the same pitch, whichever comes first. For example, if a C4 note ends at 2.0 seconds but the sustain pedal is active until 3.0 seconds, the note's duration would be extended to 3.0 seconds (or earlier if another C4 note starts before 3.0 seconds).
Traditionally, these extended notes are used for both training and evaluation. We will use pedal-extended notes for evaluation to maintain consistency with existing literature. However, we recognize that note duration extension may be redundant if sustain pedals are accurately transcribed.
To accommodate various approaches, we allow participants to train their systems on either pedal-extended or non-extended notes.
Submissions can follow one of these formats:
- Transcribe pedal-extended notes and sustain pedal events
- Transcribe pedal-extended notes only (only note-related metrics will be computed)
- Transcribe raw MIDI notes and sustain pedal events
During evaluation, notes from all three types of submissions will be comparable, as transcribed pedal information will be used to extend the transcribed notes when necessary.
Additionally, we will compute note metrics only for notes without pedal extension, if any submission falls into the following two categories:
- Transcribe raw MIDI notes and sustain pedal events
- Transcribe raw MIDI notes only
Dataset
Training Data
- Participants must train their models using the official training split of the MAESTRO v3.0.0 dataset [2].
- The validation split of MAESTRO v3.0.0 can be used for hyperparameter tuning and model selection.
- Data augmentation is allowed during training. This differs from many published papers but is permitted as it has been shown to improve generalization to datasets beyond MAESTRO.
Evaluation Data
Evaluation will be performed on the following datasets:
- MAESTRO v3.0.0 test split (177 audio files)
- Acoustic recordings from MAPS dataset [3] (60 audio files):
- ENSTDkCl/MUS subset
- ENSTDkAm/MUS subset
- SMD-piano dataset version 2 (50 audio files)
Data Usage Guidelines
- Only the MAESTRO training split should be used for model training.
- Participants must not use any part of the evaluation datasets (MAESTRO test split, MAPS, or SMD-piano) for training or tuning their models.
- The MAESTRO validation split may be used for development purposes, but the final evaluation will be on the specified test datasets.
Metrics
We follow the standard piano transcription evaluation procedure, as in (Hawthorne et al. 2018):
Activation-level (frame-level) metrics
- We compute Precision, recall, and F1 measures
- These metrics assess how accurately the predicted pitch activations match the ground truth.
- Instead of traditional time-axis discretization, we compute these metrics in continuous time to avoid bias against specific model hop sizes. It is equivalent to frame-level metrics with an infinitesimal hop size.
- This continuous-time version was first introduced in (Yan et al. 2021)
Event-level metrics
Using the mir_eval library [4], we compute precision, recall, and F1 measures for three sub-levels:
- Note: A note prediction is considered correct if its estimated onset is within 50ms of the corresponding ground truth onset
- Note with offset: In addition to the onset criterion, the offset of the estimated note must be within 50ms or 20% of the note duration from the ground truth offset, whichever is greater
- Note with offset and velocity: In addition to onset and offset criteria, the normalized estimated midi velocity must be within 10% of the ground truth velocity. This normalized estimated velocity is described in (Hawthorne et al 2018) that uses linear regression to rescale and shift the estimated velocity to match the groudtruth.
In addition to these classification based metrics, we also compute the mean onset/offset deviations (in milliseconds) for matched notes between predicted and ground truth notes.
Additional procedure for mitigating systematic biases in specific evaluation datasets
Correcting alignment errors
As demonstrated in (Yan et al. 2024), systematic alignment errors in MAPS may change the conclusion of comparing different models. The existence of certain systematic alignment errors may assign lower score for a model with higher accuracy and timing precision. These biases, often resulting from recording setup or data processing methods, can lead to systematic errors in note timings. Correcting these biases ensures a fair comparison between different transcription systems.
To mitigate the alignment issue in the evaluation datasets:
- MAPS: We apply a piece-specific alignment correction before evaluation.
- SMD: We use an updated version of the dataset with corrected alignments (Version 2).
Submission
To simplify the submission process, participants are only required to submit their transcription results in MIDI format (.mid). Submission Guidelines:
- Transcribe each audio file in the evaluation set to a corresponding MIDI file.
- Name each MIDI file exactly the same as the original audio file, but replace the audio extension with .mid.
- Organize your submission in folders:
- Create a folder for each dataset: "maestro", "MAPS", and "SMD"
- Place all MIDI files for each dataset in its respective folder
- If your system does not output notes with extended duration according to pedals:
- Add "_no_ext" to the folder names
- Example: "maestro_no_ext", "MAPS_no_ext", "SMD_no_ext"
- Similarly, if you system does not output pedals, add postfix "_no_pedal" to the folder names
- Include an extended abstract for briefly describe the design of your system. Format: PDF of 2-4 pages. If it comes from a published paper, you can choose to include that paper directly
- Put all folders into a single .zip file, name it with the identifier you want for your system.
Your final submission structure should look like this:
teamA.zip/ ├── maestro/ (or maestro_no_ext/) │ ├── MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_08_Track08_wav.mid │ ├── MIDI-Unprocessed_SMF_02_R1_2004_01-05_ORIG_MID--AUDIO_02_R1_2004_10_Track10_wav.mid │ └── ... ├── MAPS/ (or MAPS_no_ext/) │ ├── MAPS_MUS-alb_se2_ENSTDkCl.mid │ ├── MAPS_MUS-bk_xmas1_ENSTDkAm.mid │ └── ... ├── SMD/ (or SMD_no_ext/) │ ├── Bach_BWV849-01_001_20090916-SMD.mid │ ├── Bach_BWV849-02_001_20090916-SMD.mid │ └── ... └── teamA.pdf (extended abstract, 2-4 pages, or the published paper about the system)
Important Notes:
- This is the first year for MIREX to restart the Polyphonic Transcription task.
- We welcome submissions from both new systems and published papers.
- Authors of published papers are encouraged to submit their results to help test the entire evaluation procedure.
- If submitting results from a published paper, you may include the full paper as your extended abstract.
- Your participation will help us refine and improve the evaluation process for future iterations of this task.
If you have any questions about the submission process or need assistance, please don't hesitate to contact the task organizers.
Questions or Suggestions
If you have any question, or you have difficulty accessing any dataset for evaluation for this task, contact:
- Yujia Yan: yujia.yan<at>rochester.edu
- Ziyu Wang: ziyu.wang<at>nyu.edu
References
- Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, Douglas Eck. "Onsets and Frames: Dual-Objective Piano Transcription". In: Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR). 2018.
- Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck. "Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset". In: Proceedings of the International Conference on Learning Representations (ICLR). 2019.
- Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang. "High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times". In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2020), pp. 3707-3717.
- Weixing Wei, Peilin Li, Yi Yu, Wei Li. "HPPNet: Modeling the Harmonic Structure and Pitch Invariance in Piano Transcription". In: Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR). Bengaluru, India, 2022, pp. 709-716.
- Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Weimin Liao, Yuki Mitsufuji. "Automatic Piano Transcription with Hierarchical Frequency-Time Transformer". In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 2023.
- Yujia Yan, Frank Cwitkowitz, Zhiyao Duan. "Skipping the Frame-Level: Event-Based Piano Transcription With Neural Semi-CRFs". In: Advances in Neural Information Processing Systems. 2021.
- Yujia Yan, Zhiyao Duan. "Scoring Time Intervals Using Non-Hierarchical Transformer for Automatic Piano Transcription". (To Appear) In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 2024.