2024:Polyphonic Transcription

1 Introduction
2 Task Description
3 Note Duration Extension according to Pedal
4 Dataset
5 Metrics
- 5.1 Activation-level (frame-level) metrics
- 5.2 Event-level metrics
6 Additional procedure for mitigating systematic biases in specific evaluation datasets
- 6.1 Correcting alignment errors
7 Submission
8 Questions or Suggestions
9 References

Introduction

We are introducing polyphonic transcription as a MIREX task for the first time. This year's focus is on piano transcription (audio-to-MIDI). For piano transcription, our task is designed to be compatible with most published papers on the topic (Hawthorne et al. 2019; Kong et al. 2020; Toyama et al. 2023; Yan et al 2021, 2024;), aiming to standardize evaluation in the field.

Task Description

The goal of piano transcription is to convert solo piano recordings into their symbolic representation, specifically the MIDI format. Participants must develop systems that can:

Extract notes with correct onset, offset, pitch and MIDI velocity
Extract sustain pedal (CC64) events, which are crucial for transcribing expressive performances (sostenuto and soft pedals are excluded for simplicity, as they are relatively rare in the MAESTRO dataset and less commonly reported in published papers)

For simplicity, sustain pedal events are binarized: CC values ≥ 64 represent 'on', and < 64 represent 'off'. Participants' systems must output standard MIDI format (.mid) files. This can be achieved by converting a list of (onset, offset, pitch, velocity) tuples using libraries such as prettyMIDI [1].

Note Duration Extension according to Pedal

Most published papers on audio-to-MIDI automatic piano transcription follow this convention: when a note's offset occurs while the sustain pedal is active, the note's offset is extended to either the pedal release or the onset of the next note of the same pitch, whichever comes first.

Traditionally, these extended notes are used for both training and evaluation. We will use pedal-extended notes for evaluation to maintain consistency with existing literature. However, we recognize that note duration extension may be redundant if sustain pedals are accurately transcribed.

To accommodate various approaches, we allow participants to train their systems on either pedal-extended or non-extended notes.

Submissions can follow one of these formats:

Transcribe pedal-extended notes and sustain pedal events
Transcribe pedal-extended notes only (only note-related metrics will be computed)
Transcribe raw MIDI notes and sustain pedal events

During evaluation, notes from all three types of submissions will be comparable, as transcribed pedal information will be used to extend the transcribed notes when necessary.

Dataset

Training Data

Participants must train their models using the official training split of the MAESTRO dataset [2].
The validation split of MAESTRO can be used for hyperparameter tuning and model selection.
Data augmentation is allowed during training. This differs from many published papers but is permitted as it has been shown to improve generalization to datasets beyond MAESTRO.

Evaluation Data

Evaluation will be performed on the following datasets:

MAESTRO test split
Acoustic recordings from MAPS dataset [3]:
- ENSTDkCl/MUS subset
- ENSTDkAm/MUS subset
SMD-piano dataset version 0 version 1

Note: The final composition of the evaluation datasets is subject to change. More datasets could be included prior to the submission deadline.

Data Usage Guidelines

Only the MAESTRO training split should be used for model training.
Participants must not use any part of the evaluation datasets (MAESTRO test split, MAPS, or SMD-piano) for training or tuning their models.
The MAESTRO validation split may be used for development purposes, but the final evaluation will be on the specified test datasets.

Metrics

We follow the standard piano transcription evaluation procedure, as in (Hawthorne et al. 2018):

Activation-level (frame-level) metrics

Precision, recall, and F1 measures
These metrics assess how accurately the predicted pitch activations match the ground truth.
Instead of traditional time-axis discretization, we compute these metrics in continuous time to avoid bias against specific model hop sizes. It is equivalent to frame-level metrics with an infinitesimal hop size.
This continuous-time version was first introduced in (Yan et al. 2021)

Event-level metrics

Using the mir_eval library [4], we compute precision, recall, and F1 measures for three sub-levels:

Note: A note prediction is considered correct if its estimated onset is within 50ms of the corresponding ground truth onset
Note with offset: In addition to the onset criterion, the offset of the estimated note must be within 50ms or 20% of the note duration from the ground truth offset, whichever is greater
Note with offset and velocity: In addition to onset and offset criteria, the normalized estimated midi velocity must be within 0.1 of the normalized ground truth velocity

In addition to these classification based metrics, we also compute the mean onset/offset deviations (in milliseconds) for matched notes between predicted and ground truth notes.

Additional procedure for mitigating systematic biases in specific evaluation datasets

Correcting alignment errors

As demonstrated in (Yan et al. 2024), a systematic alignment error may change the conclusion of comparing different models. For example, MAPS dataset contains a piece-dependent delay of approximately 15ms; SMD dataset (version 0) exhibits significant latency introduced by MP3 encoding/decoding.

To mitigate these issues: 1. MAPS: We apply a piece-specific alignment correction before evaluation. 2. SMD: We use an updated version of the dataset with corrected alignments.

Submission

TBD

Questions or Suggestions

Contact:

Yujia Yan: yujia.yan<at>rochester.edu
Ziyu Wang: ziyu.wang<at>nyu.edu

References

Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, Douglas Eck. "Onsets and Frames: Dual-Objective Piano Transcription". In: Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR). 2018.
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck. "Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset". In: Proceedings of the International Conference on Learning Representations (ICLR). 2019.
Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang. "High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times". In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2020), pp. 3707-3717.
Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Weimin Liao, Yuki Mitsufuji. "Automatic Piano Transcription with Hierarchical Frequency-Time Transformer". In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 2023.
Yujia Yan, Frank Cwitkowitz, Zhiyao Duan. "Skipping the Frame-Level: Event-Based Piano Transcription With Neural Semi-CRFs". In: Advances in Neural Information Processing Systems. 2021.
Yujia Yan, Zhiyao Duan. "Scoring Time Intervals Using Non-Hierarchical Transformer for Automatic Piano Transcription". (To Appear) In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 2024.

2024:Polyphonic Transcription

Contents

Introduction

Task Description

Note Duration Extension according to Pedal

Dataset

Training Data

Evaluation Data

Data Usage Guidelines

Metrics

Activation-level (frame-level) metrics

Event-level metrics

Additional procedure for mitigating systematic biases in specific evaluation datasets

Correcting alignment errors

Submission

Questions or Suggestions

References

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools