Difference between revisions of "2026:Audio-to-Score Transcription"
(→MuseSyn Test Set) |
(Remove MuseSyn) |
||
| Line 91: | Line 91: | ||
* 16 bit FLAC | * 16 bit FLAC | ||
| − | * | + | * 22 050 Hz |
| − | * 30s (for Quartets | + | * 30s (for Quartets) |
If your pipeline expects characteristics different from the ones described above, the conversion should be done automatically in your algorithm. | If your pipeline expects characteristics different from the ones described above, the conversion should be done automatically in your algorithm. | ||
| Line 162: | Line 162: | ||
* [https://huggingface.co/datasets/PRAIG/quartets-quartets Quartets dataset on Hugging Face] | * [https://huggingface.co/datasets/PRAIG/quartets-quartets Quartets dataset on Hugging Face] | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
= Evaluation Datasets = | = Evaluation Datasets = | ||
| Line 185: | Line 173: | ||
Evaluation will include the test split of the Quartets dataset. Results on this dataset will be reported separately. | Evaluation will include the test split of the Quartets dataset. Results on this dataset will be reported separately. | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
=== Undisclosed Evaluation Set === | === Undisclosed Evaluation Set === | ||
Revision as of 03:09, 29 June 2026
Contents
Description
This page describes the MIREX2026: Audio-to-Score Transcription challenge. For evaluation procedure and the submission format please scroll down the page.
Automatic Music Transcription (AMT) is the task of designing computational algorithms that convert acoustic music signals into a symbolic musical representation [2]. Existing MIREX tasks have evaluated several important AMT components, including melody extraction, drum detection, piano transcription, and multi-instrument transcription. However, these tasks have generally stopped at MIDI or note-event outputs, which are not equivalent to a well-formed musical score.
The Audio-to-Score Transcription (A2S) challenge focuses on this missing step. Participating systems will receive an audio recording of a polyphonic music piece and must output a digital symbolic score that can be read by a musician or notation program. The target output is a kern score encoding standard musical information such as pitches, durations, accidentals, staves, meter, clef, key signature, and bar structure.
The task can be informally expressed as follows:
Prediction(S) = argmax P(S|X)
where S is a symbolic score and X is the input audio signal.
The challenge will include polyphonic recordings and may include multiple instruments. Pieces may be full recordings or shorter excerpts, such as 3 to 6 bars.
Two tracks will be considered:
- Staves-Informed A2S: staves metadata is provided and may be used to guide voice identification and instrument recognition.
- Blind A2S: only the audio recording is provided.
Participants may enter both tracks. Results will be reported separately.
Notice: We particularly encourage submissions that move beyond isolated AMT components and produce complete, valid symbolic scores.
Evaluation
Submissions will be evaluated with metrics that assess both the textual quality of the produced kern files and the musical transcription quality of the corresponding scores.
Score Quality
Character Error Rate (CER): the Levenshtein distance between the submitted and ground-truth kern files at the character level. This metric penalizes character insertions, deletions, and substitutions, and is sensitive to details such as pitch spelling and note durations.
Word Error Rate (WER): the Levenshtein distance computed on tab- and newline-separated strings in a kern file. WER is reported in recent A2S papers [1], [3]. Compared to CER, it evaluates the output closer to the note-token level, checking whether durations and pitches are jointly correct.
Line Error Rate (LER): the Levenshtein distance computed on complete lines of the output kern files. LER is stricter than CER and WER, and helps verify whether multiple instruments or voices are correctly transcribed and aligned at the beat and sub-beat level.
Musical Transcription Quality
In addition to CER, WER, and LER, this challenge will use the MV2H metric introduced in [4] and used in recent A2S work [1], [3]. MV2H is an F-score obtained as the arithmetic mean of five sub-metrics:
- Multi-pitch detection: pitches must be correct and detected at onsets within a 50 ms error threshold.
- Voice separation: notes belonging to the same voice or instrument should be grouped together.
- Metrical alignment: bars, beats, and sub-beats should be correctly identified.
- Note value detection: note durations should be correctly transcribed.
- Harmonic analysis: average of a key detection score and a chord-symbol recall.
The harmonic analysis component will not be used for the initial edition because no chord information is available in the considered datasets.
Public implementations of MV2H are available:
The evaluation pipeline will adapt these implementations to process kern files, building where appropriate on the public code from [1]:
Note that WER and MV2H will be retained as the main ranking criteria. CER and LER will be reported for more detailed comparisons.
In addition to these performance metrics, each submission will be evaluated in terms of memory use, number of operations, and computational time required to process the evaluation set.
Submission Format
Several submission formats will be accepted to accommodate different system designs. If you encounter any issue with the submission process, please contact the Task Captain.
General Guidelines
All submissions should be "plug-and-play", with a clear README detailing usage steps.
The recommended submission format is a Docker image or a code repository with a main bash or Python script to run.
Resources Declaration: All submissions must state:
- The training data size
- The number of parameters in the model, if applicable
- The amount of GPU/CPU hours used for training, with device information (model and VRAM)
- The inference time required for the evaluation set
We strongly recommend sharing the submitted algorithms or checkpoints under permissive open licenses, but any licensing (or even not sharing publicly) is accepted as long as it is clearly reported at the time of submission.
I / O
The submitted algorithm must take as input an audio file, or a folder containing audio files, and write one predicted **kern file for each input audio file to a specified output directory.
Input Audio
Participating algorithms will receive audio files in the following format:
- 16 bit FLAC
- 22 050 Hz
- 30s (for Quartets)
If your pipeline expects characteristics different from the ones described above, the conversion should be done automatically in your algorithm.
For the Blind A2S track, only the audio recording will be provided.
For the Staves-Informed A2S track, staves metadata will also be provided. This metadata is the **kern header, that looks like this in the Quartets dataset for example:
**kern **dynam **kern **dynam **kern **dynam **kern **dynam *Icello *Icello *Iviola *Iviola *Ivioln *Ivioln *Iflt *Iflt *clefF4 * *clefC3 * *clefG2 * *clefG2 * *k[f#c#] * *k[f#c#] * *k[f#c#] * *k[f#c#] * *D: * *D: * *D: * *D: * *M4/4 * *M4/4 * *M4/4 * *M4/4 * *MM130 * *MM130 * *MM130 * *MM130 *
Output File Format
Each submitted prediction must be a valid **kern file.
The score must encode, at minimum:
- Staves musical metadata, including meter, clef, and key signature when required
- Pitches, including octaves and accidentals
- Note durations
- Bar lines and metrical structure
- Voice or instrument organization when applicable
For more details, participants may refer to:
We chose the **kern format for compatibility with existing SOTA work in A2S. Participants may use other formats during training or as intermediate outputs, but the final files should be in **kern format for evaluation.
Please make sure the proposed algorithm include a conversion step if it is required.
Code Submissions
Participants may provide a Docker container or access to a code repository with clear environment setup instructions and a script that reads the input audio and writes output kern files to the requested destination directory.
The submitted script should be documented in the README and should not require manual intervention during evaluation.
Fallback: Pre-computed Kern Submission
Participants may alternatively submit the final kern files directly, similarly to the option proposed for MIREX 2024 Polyphonic Transcription.
This fallback is intended to allow participation from systems that cannot be evaluated directly by the Task Captain because of resource, licensing, or infrastructure constraints. Such submissions will be clearly flagged in the results page.
Training Datasets
Participants are free to use the training and validation sets of the datasets listed below. Data augmentation is allowed.
Additional training data may also be used, as long as it does not overlap with the evaluation sets and is clearly documented in the submission.
Quartets Dataset
The Quartets dataset consists of synthetic audio renderings of Haydn, Mozart, and Beethoven string quartets, together with their full kern scores [6].
The scores are taken from the humdrum-data repository. The kern files were split into 3 to 6 measure excerpts and synthesized into audio from performance MIDI files.
The final dataset contains approximately 20 hours of audio for 38,051 excerpts and 3 composers:
- 18,162 excerpts for Haydn
- 7,435 excerpts for Mozart
- 12,454 excerpts for Beethoven
The train, validation, and test splits used in [1] are publicly available:
Evaluation Datasets
The datasets listed below are reserved for evaluation purposes and must not be used for training models.
We also request participants to refrain from checking the performance of their algorithms on these sets before submission, as it would make them equivalent to validation sets and could lead to data leakage.
Quartets Test Set
Evaluation will include the test split of the Quartets dataset. Results on this dataset will be reported separately.
Undisclosed Evaluation Set
An additional custom evaluation set will be used to ensure fairness and out-of-distribution data.
General details on this dataset are: TBD
Time and Hardware Limits
Due to the potentially high number of participants in this and other audio tasks, hard limits on runtime and hardware use will be imposed.
Submissions that require more than 32 GB of VRAM, or more than 24 hours to process the test sets on a single GPU (V100 or similar), cannot be evaluated directly by the Task Captain. Participants in this situation may use the fallback kern submission format described above.
Questions?
- Contact Alexandre D'Hooge (Alex, he/him): dhooge[at]gbu[dot]edu[dot]cn
Bibliography
[1] Alfaro-Contreras, M., et al. (2024). A Transformer Approach for Polyphonic Audio-to-Score Transcription. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[2] Benetos, E., et al. (2019). Automatic Music Transcription: An Overview. IEEE Signal Processing Magazine, 36(1).
[3] Liu, L., et al. (2021). Joint Multi-Pitch Detection and Score Transcription for Polyphonic Piano Music. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[4] Mcleod, A., et al. (2018). Evaluating Automatic Polyphonic Music Transcription. Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR).
[5] Roman, M. A., et al. (2018). An End-to-End Framework for Audio-to-Score Music Transcription on Monophonic Excerpts. Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR).
[6] Roman, M. A., et al. (2019). A Holistic Approach to Polyphonic Music Transcription with Neural Networks. Proc. of the 20th International Society for Music Information Retrieval Conference (ISMIR).
[7] Smaragdis, P., et al. (2003). Non-Negative Matrix Factorization for Polyphonic Music Transcription. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.