2026:Audio-to-Score Transcription

From MIREX Wiki
Revision as of 01:14, 13 June 2026 by Alexandre DHooge (talk | contribs) (Create draft wiki page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Description

This page describes the MIREX2026: Audio-to-Score Transcription challenge. For evaluation procedure and the submission format please scroll down the page.

Automatic Music Transcription (AMT) is the task of designing computational algorithms that convert acoustic music signals into a symbolic musical representation [2]. Existing MIREX tasks have evaluated several important AMT components, including melody extraction, drum detection, piano transcription, and multi-instrument transcription. However, these tasks have generally stopped at MIDI or note-event outputs, which are not equivalent to a well-formed musical score.

The Audio-to-Score Transcription (A2S) challenge focuses on this missing step. Participating systems will receive an audio recording of a polyphonic music piece and must output a digital symbolic score that can be read by a musician or notation program. The target output is a kern score encoding standard musical information such as pitches, durations, accidentals, staves, meter, clef, key signature, and bar structure.

The task can be informally expressed as follows:

 Prediction(S) = argmax P(S|X)

where S is a symbolic score and X is the input audio signal.

The challenge will include polyphonic recordings and may include multiple instruments. Pieces may be full recordings or shorter excerpts, such as 3 to 6 bars.

Two tracks will be considered:

  • Staves-Informed A2S: staves metadata is provided and may be used to guide voice identification and instrument recognition.
  • Blind A2S: only the audio recording is provided.

Participants may enter both tracks. Results will be reported separately.

Notice: We particularly encourage submissions that move beyond isolated AMT components and produce complete, valid symbolic scores.

Evaluation

Submissions will be evaluated with metrics that assess both the textual quality of the produced kern files and the musical transcription quality of the corresponding scores.

Score Quality

Character Error Rate (CER): the Levenshtein distance between the submitted and ground-truth kern files at the character level. This metric penalizes character insertions, deletions, and substitutions, and is sensitive to details such as pitch spelling and note durations.

Word Error Rate (WER): the Levenshtein distance computed on tab- and newline-separated strings in a kern file. WER is reported in recent A2S papers [1], [3]. Compared to CER, it evaluates the output closer to the note-token level, checking whether durations and pitches are jointly correct.

Line Error Rate (LER): the Levenshtein distance computed on complete lines of the output kern files. LER is stricter than CER and WER, and helps verify whether multiple instruments or voices are correctly transcribed and aligned at the beat and sub-beat level.

Musical Transcription Quality

In addition to CER, WER, and LER, this challenge will use the MV2H metric introduced in [4] and used in recent A2S work [1], [3]. MV2H is an F-score obtained as the arithmetic mean of five sub-metrics:

  • Multi-pitch detection: pitches must be correct and detected at onsets within a 50 ms error threshold.
  • Voice separation: notes belonging to the same voice or instrument should be grouped together.
  • Metrical alignment: bars, beats, and sub-beats should be correctly identified.
  • Note value detection: note durations should be correctly transcribed.
  • Harmonic analysis: average of a key detection score and a chord-symbol recall.

The harmonic analysis component will not be used for the initial edition because no chord information is available in the considered datasets.

Public implementations of MV2H are available:

The evaluation pipeline will adapt these implementations to process kern files, building where appropriate on the public code from [1]:

Note that WER and MV2H will be retained as the main ranking criteria. CER and LER will be reported for more detailed comparisons.


In addition to these performance metrics, each submission will be evaluated in terms of memory use, number of operations, and computational time required to process the evaluation set.

Submission Format

Several submission formats will be accepted to accommodate different system designs. If you encounter any issue with the submission process, please contact the Task Captain.

General Guidelines

All submissions should be "plug-and-play", with a clear README detailing usage steps.

The recommended submission format is a Docker image or a code repository with a main bash or Python script to run.

Resources Declaration: All submissions must state:

  • The training data size
  • The number of parameters in the model, if applicable
  • The amount of GPU/CPU hours used for training, with device information (model and VRAM)
  • The inference time required for the evaluation set

We strongly recommend sharing the submitted algorithms or checkpoints under permissive open licenses, but any licensing (or even not sharing publicly) is accepted as long as it is clearly reported at the time of submission.

I / O

The submitted algorithm must take as input an audio file, or a folder containing audio files, and write one predicted **kern file for each input audio file to a specified output directory.

Input Audio

Participating algorithms will receive audio files in the following format:

TBD


For the Blind A2S track, only the audio recording will be provided.

For the Staves-Informed A2S track, staves metadata will also be provided. This metadata may include meter, clef, key signature, and other score-structure information needed to guide transcription. The exact metadata packaging will be specified before submissions open.

Output File Format

Each submitted prediction must be a valid **kern file.

The score must encode, at minimum:

  • Staves musical metadata, including meter, clef, and key signature when required
  • Pitches, including octaves and accidentals
  • Note durations
  • Bar lines and metrical structure
  • Voice or instrument organization when applicable

For more details, participants may refer to:

We chose the **kern format for compatibility with existing SOTA work in A2S. Participants may use other formats during training or as intermediate outputs, but the final files should be in **kern format for evaluation. Please make sure the proposed algorithm include a conversion step if it is required.

Code Submissions

Participants may provide a Docker container or access to a code repository with clear environment setup instructions and a script that reads the input audio and writes output kern files to the requested destination directory.

The submitted script should be documented in the README and should not require manual intervention during evaluation.

Fallback: Pre-computed Kern Submission

Participants may alternatively submit the final kern files directly, similarly to the option proposed for MIREX 2024 Polyphonic Transcription.

This fallback is intended to allow participation from systems that cannot be evaluated directly by the Task Captain because of resource, licensing, or infrastructure constraints. Such submissions will be clearly flagged in the results page.

Training Datasets

Participants are free to use the training and validation sets of the datasets listed below. Data augmentation is allowed.

Additional training data may also be used, as long as it does not overlap with the evaluation sets and is clearly documented in the submission.

Quartets Dataset

The Quartets dataset consists of synthetic audio renderings of Haydn, Mozart, and Beethoven string quartets, together with their full kern scores [6].

The scores are taken from the humdrum-data repository. The kern files were split into 3 to 6 measure excerpts and synthesized into audio from performance MIDI files.

The final dataset contains approximately 20 hours of audio for 38,051 excerpts and 3 composers:

  • 18,162 excerpts for Haydn
  • 7,435 excerpts for Mozart
  • 12,454 excerpts for Beethoven

The train, validation, and test splits used in [1] are publicly available:

MuseSyn Dataset

The MuseSyn dataset contains 210 piano pieces with scores in MusicXML format and audio synthesized through four different piano models [3]. It amounts to almost 10 hours of audio recordings for each piano timbre.

The pieces cover a wide range of key signatures, time signatures, tempos, and polyphony levels. The dataset is available upon request for non-commercial research use:

For this challenge, the MusicXML files will be converted to kern format, with manual verification where needed, to unify the evaluation pipeline.

Evaluation Datasets

The datasets listed below are reserved for evaluation purposes and must not be used for training models.

We also request participants to refrain from checking the performance of their algorithms on these sets before submission, as it would make them equivalent to validation sets and could lead to data leakage.

Quartets Test Set

Evaluation will include the test split of the Quartets dataset. Results on this dataset will be reported separately.

MuseSyn Test Set

Evaluation will include the test split of the MuseSyn dataset after conversion of the reference scores to kern format. Results on this dataset will be reported separately.

Undisclosed Evaluation Set

An additional custom evaluation set will be used to ensure fairness and out-of-distribution data.

General details on this dataset are: TBD

Time and Hardware Limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on runtime and hardware use will be imposed.

Submissions that require more than 32 GB of VRAM, or more than 24 hours to process the test sets on a single GPU (V100 or similar), cannot be evaluated directly by the Task Captain. Participants in this situation may use the fallback kern submission format described above.

Questions?

  • Contact Alexandre D'Hooge (Alex, he/him): dhooge[at]gbu[dot]edu[dot]cn

Bibliography

[1] Alfaro-Contreras, M., et al. (2024). A Transformer Approach for Polyphonic Audio-to-Score Transcription. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Benetos, E., et al. (2019). Automatic Music Transcription: An Overview. IEEE Signal Processing Magazine, 36(1).

[3] Liu, L., et al. (2021). Joint Multi-Pitch Detection and Score Transcription for Polyphonic Piano Music. Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Mcleod, A., et al. (2018). Evaluating Automatic Polyphonic Music Transcription. Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR).

[5] Roman, M. A., et al. (2018). An End-to-End Framework for Audio-to-Score Music Transcription on Monophonic Excerpts. Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR).

[6] Roman, M. A., et al. (2019). A Holistic Approach to Polyphonic Music Transcription with Neural Networks. Proc. of the 20th International Society for Music Information Retrieval Conference (ISMIR).

[7] Smaragdis, P., et al. (2003). Non-Negative Matrix Factorization for Polyphonic Music Transcription. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.