Difference between revisions of "2026:Music Performance Difficulty Prediction"
(Created page with "= Music Performance Difficulty Prediction 2026 = Welcome to the official MIREX wiki page for the '''Music Performance Difficulty Prediction''' task, new for MIREX 2026. This...") |
(→Music Performance Difficulty Prediction 2026) |
||
| Line 5: | Line 5: | ||
== Overview == | == Overview == | ||
| − | Music performance difficulty prediction has emerged as an active MIR sub-field over the past five years, with foundational datasets (Mikrokosmos-difficulty, CIPI, PSyllabus) and methods spanning symbolic, image, and audio modalities ( | + | Music performance difficulty prediction has emerged as an active MIR sub-field over the past five years, with foundational datasets (Mikrokosmos-difficulty, CIPI, PSyllabus) and methods spanning symbolic, image-based, and audio-based modalities (Ramoneda et al., 2022–2025). Among these directions, audio-based difficulty estimation remains the most recent and least explored. Unlike symbolic or score-based approaches, audio-only methods do not require machine-readable scores, substantially widening the applicability of difficulty-prediction systems for music libraries, educational platforms, teachers, students, and recommendation systems. |
| − | + | Despite this growing interest, the field still lacks a standardized community benchmark for audio-based difficulty prediction. Existing studies rely on heterogeneous datasets, incompatible grading systems, and non-comparable evaluation protocols. This shared task establishes the first community-wide benchmark for piano difficulty estimation from audio. | |
== Task Description == | == Task Description == | ||
| − | Participants submit systems that take a '''solo piano audio recording''' as input and output a predicted difficulty | + | Participants submit systems that take a '''solo piano audio recording''' as input and output a predicted difficulty score. |
* '''Input format''': WAV, 44.1 kHz, mono or stereo. | * '''Input format''': WAV, 44.1 kHz, mono or stereo. | ||
| − | * '''Output format''': | + | * '''Output format''': a real-valued difficulty score per recording. The score is treated as an ordering — its absolute scale, range, or number of categories does not need to match the (hidden) evaluation scale. |
| − | * '''Two recordings per piece''': each test piece is represented by at least two recordings — one human performance and one synthesized rendering — scored independently and aggregated per piece. This is designed so that systems cannot game the task by relying purely on performance-quality or recording-condition cues. | + | * '''Two recordings per piece''': each test piece is represented by at least two recordings — one human performance and one synthesized score rendering from a fixed soundfont — scored independently and aggregated per piece. This is designed so that systems cannot game the task by relying purely on performance-quality or recording-condition cues. |
| − | This | + | There is no inference-time limit per piece, but the full test set must be processed within a 24-hour wall-clock budget on a single GPU. |
| + | |||
| + | == Significance of the Task == | ||
| + | |||
| + | This shared task establishes the first community-wide benchmark for piano difficulty estimation from audio, featuring: | ||
| + | |||
| + | * A consistent ordinal evaluation framework | ||
| + | * A fully private held-out test set curated specifically for this task | ||
| + | * Standardized ordinal regression and ranking metrics | ||
| + | * Composer-disjoint evaluation splits designed to measure true generalization beyond repertoire memorization | ||
| + | * A hidden pedagogical difficulty scale validated by expert pedagogues and musicologists | ||
| + | |||
| + | Beyond benchmarking, the task aims to lower the entry barrier for researchers working on audio representation learning, multimodal learning, and music understanding, while fostering a growing research community at the intersection of MIR, pedagogy, and computational music education. | ||
== Evaluation Criteria == | == Evaluation Criteria == | ||
| − | + | During development, participants are encouraged to evaluate their systems on public datasets such as PSyllabus using standard ordinal-regression metrics, including: | |
| − | * | + | * Mean Squared Error (MSE) |
| − | * | + | * Accuracy within one level (Acc±1) |
| − | * | + | * Balanced accuracy |
| − | * ' | + | * Spearman's ρ |
| − | + | These metrics are useful for model selection and comparison on publicly annotated datasets with explicit grade structures. | |
| + | |||
| + | However, because the official evaluation set uses a '''hidden ordinal pedagogical scale''' whose granularity and number of levels are not disclosed, the shared-task ranking is based exclusively on '''Kendall's Tau-c''', which measures agreement between predicted and reference orderings without assuming fixed distances between categories. | ||
| + | |||
| + | In addition to the official ranking metric, organizers may report supplementary analyses such as per-composer performance to better understand cross-repertoire generalization and robustness across stylistic domains. | ||
== Training Datasets == | == Training Datasets == | ||
| Line 36: | Line 52: | ||
Suggested datasets for training and validation: | Suggested datasets for training and validation: | ||
| − | * '''[https://zenodo.org/records/12783403 PSyllabus]''' (Ramoneda et al., | + | * '''[https://zenodo.org/records/12783403 PSyllabus]''' (Ramoneda et al., 2025): 7,901 audio recordings of classical piano pieces with 11-level pedagogical grades, together with mappings to multiple international grading systems. We specifically encourage participants to leverage the multiple ranking annotations available in PSyllabus (13 grading/ranking systems in total), not only the default labels, in order to study and improve generalization across heterogeneous pedagogical traditions and difficulty scales. |
| − | * '''[https://zenodo.org/records/8037327 CIPI (Can I Play It?)]''' (Ramoneda et al., | + | * '''[https://zenodo.org/records/8037327 CIPI (Can I Play It?)]''' (Ramoneda et al., 2024): 652 MusicXML piano pieces with 9-level Henle annotations, useful for participants interested in transcription-then-classification pipelines or distant supervision approaches. |
| − | * '''[https://github.com/PRamoneda/Mikrokosmos-difficulty Mikrokosmos-difficulty]''' | + | * '''[https://github.com/PRamoneda/Mikrokosmos-difficulty Mikrokosmos-difficulty]''' (Ramoneda et al., 2022) and other publicly available piano difficulty datasets and pedagogical collections. |
| − | |||
| − | + | Participants may use any additional external data, provided that no part of the held-out evaluation repertoire is used directly or indirectly for training. To minimize contamination risks, a list of forbidden test pieces and composers will be published together with the task call. | |
| − | * Dataset | + | |
| + | Please describe in your technical report: | ||
| + | * Dataset names and sources | ||
* Size and number of pieces | * Size and number of pieces | ||
* Any preprocessing, cleaning, or label-space remapping applied | * Any preprocessing, cleaning, or label-space remapping applied | ||
| Line 48: | Line 65: | ||
== Held-Out Test Set == | == Held-Out Test Set == | ||
| − | To | + | To address the central concern of label trust and contamination, the held-out test set is constructed as a '''fully private benchmark''' curated specifically for this task. The dataset is assembled and annotated by expert piano pedagogues, with all annotations independently validated by a pedagogue and musicologist expert. Difficulty labels are assigned using an internal ordinal scale inspired by established pedagogical curricula and examination systems, but the exact mapping, granularity, and total number of levels are intentionally undisclosed to participants. |
| − | + | * The repertoire is selected from sources outside existing public piano difficulty datasets, including contemporary works, underrepresented composers, and recent pedagogical material not present in commonly used corpora. This design minimizes overlap with publicly available graded datasets and reduces the possibility of memorization or contamination effects. | |
| − | + | * All annotations undergo independent expert review and adjudication before inclusion in the benchmark. Neither the raw labels nor the precise scale definition are released publicly. | |
| − | + | * The held-out scores, audio recordings, and annotations remain private to the task organizers throughout and after the evaluation. Participants submit executable systems (Docker containers preferred), which are run by the organizers on the hidden evaluation set. | |
| − | ''' | + | Because the underlying difficulty scale is ordinal and intentionally hidden, system performance is evaluated primarily through rank correlation rather than exact class prediction. The official ranking metric is '''Kendall's Tau-c''', which measures agreement between predicted and reference difficulty orderings while remaining robust to unknown category spacing and differing numbers of ordinal levels. |
== Submission Requirements == | == Submission Requirements == | ||
| Line 60: | Line 77: | ||
The following items are required for submission: | The following items are required for submission: | ||
| − | * '''System''': packaged as a Docker container with a standardised inference interface — input is a path to a WAV file; output is | + | * '''System''': packaged as a Docker container with a standardised inference interface — input is a path to a WAV file; output is a single real-valued difficulty score. A reference Docker template and inference wrapper script will be provided. |
* '''Technical report''': 2–4 pages in ISMIR LBD format describing training data, model architecture, label-space handling, and any post-processing. | * '''Technical report''': 2–4 pages in ISMIR LBD format describing training data, model architecture, label-space handling, and any post-processing. | ||
* '''Compute declaration''': GPU memory footprint and average inference time per piece. | * '''Compute declaration''': GPU memory footprint and average inference time per piece. | ||
| Line 74: | Line 91: | ||
* '''TBD''': Submission deadline | * '''TBD''': Submission deadline | ||
* '''TBD''': Results announced at ISMIR 2026 | * '''TBD''': Results announced at ISMIR 2026 | ||
| + | |||
| + | == Long-term Plan == | ||
| + | |||
| + | We are committed to maintaining the task for at least three iterations. We will collaborate with the broader piano-difficulty research community to keep label-space conventions consistent across this task and parallel evaluation efforts. The hidden held-out test set is designed to grow incrementally across editions, with prior-year items optionally re-used subject to confirmation that participants have not gained access. | ||
| + | |||
| + | Given the field's relatively small size, we propose running this task annually for the first two years to bootstrap community engagement, and biennially thereafter. A symbolic-input track may be added in future editions if community interest warrants it. | ||
== Organizers == | == Organizers == | ||
| − | * '''Huan Zhang''' ( | + | * '''Huan Zhang''' (Queen Mary University of London) — [mailto:huan.zhang@qmul.ac.uk huan.zhang@qmul.ac.uk] |
| − | * '''Pedro Ramoneda''' ( | + | * '''Pedro Ramoneda''' (Songscription) — [mailto:pedro@songscription.ai pedro@songscription.ai] |
| + | |||
| + | == References == | ||
| − | + | * Ramoneda, P., Jeong, D., Eremenko, V., Tamer, N. C., Miron, M., & Serra, X. (2024). Combining piano performance dimensions for score difficulty classification. ''Expert Systems with Applications'', 238, 1–16. | |
| + | * Ramoneda, P., Lee, M., Jeong, D., Valero-Mas, J. J., & Serra, X. (2025). Can audio reveal music performance difficulty? Insights from the Piano Syllabus Dataset. | ||
| + | * Ramoneda, P., Tamer, N. C., Eremenko, V., Serra, X., & Miron, M. (2022). Score difficulty analysis for piano performance education based on fingering. In ''ICASSP 2022 – IEEE International Conference on Acoustics, Speech and Signal Processing'' (pp. 201–205). IEEE. | ||
Revision as of 18:06, 30 May 2026
Contents
Music Performance Difficulty Prediction 2026
Welcome to the official MIREX wiki page for the Music Performance Difficulty Prediction task, new for MIREX 2026. This task targets the automatic estimation of how technically and musically demanding a piece is to perform from audio recordings, supporting applications in music education, library cataloguing, and pedagogical recommendation systems.
Overview
Music performance difficulty prediction has emerged as an active MIR sub-field over the past five years, with foundational datasets (Mikrokosmos-difficulty, CIPI, PSyllabus) and methods spanning symbolic, image-based, and audio-based modalities (Ramoneda et al., 2022–2025). Among these directions, audio-based difficulty estimation remains the most recent and least explored. Unlike symbolic or score-based approaches, audio-only methods do not require machine-readable scores, substantially widening the applicability of difficulty-prediction systems for music libraries, educational platforms, teachers, students, and recommendation systems.
Despite this growing interest, the field still lacks a standardized community benchmark for audio-based difficulty prediction. Existing studies rely on heterogeneous datasets, incompatible grading systems, and non-comparable evaluation protocols. This shared task establishes the first community-wide benchmark for piano difficulty estimation from audio.
Task Description
Participants submit systems that take a solo piano audio recording as input and output a predicted difficulty score.
- Input format: WAV, 44.1 kHz, mono or stereo.
- Output format: a real-valued difficulty score per recording. The score is treated as an ordering — its absolute scale, range, or number of categories does not need to match the (hidden) evaluation scale.
- Two recordings per piece: each test piece is represented by at least two recordings — one human performance and one synthesized score rendering from a fixed soundfont — scored independently and aggregated per piece. This is designed so that systems cannot game the task by relying purely on performance-quality or recording-condition cues.
There is no inference-time limit per piece, but the full test set must be processed within a 24-hour wall-clock budget on a single GPU.
Significance of the Task
This shared task establishes the first community-wide benchmark for piano difficulty estimation from audio, featuring:
- A consistent ordinal evaluation framework
- A fully private held-out test set curated specifically for this task
- Standardized ordinal regression and ranking metrics
- Composer-disjoint evaluation splits designed to measure true generalization beyond repertoire memorization
- A hidden pedagogical difficulty scale validated by expert pedagogues and musicologists
Beyond benchmarking, the task aims to lower the entry barrier for researchers working on audio representation learning, multimodal learning, and music understanding, while fostering a growing research community at the intersection of MIR, pedagogy, and computational music education.
Evaluation Criteria
During development, participants are encouraged to evaluate their systems on public datasets such as PSyllabus using standard ordinal-regression metrics, including:
- Mean Squared Error (MSE)
- Accuracy within one level (Acc±1)
- Balanced accuracy
- Spearman's ρ
These metrics are useful for model selection and comparison on publicly annotated datasets with explicit grade structures.
However, because the official evaluation set uses a hidden ordinal pedagogical scale whose granularity and number of levels are not disclosed, the shared-task ranking is based exclusively on Kendall's Tau-c, which measures agreement between predicted and reference orderings without assuming fixed distances between categories.
In addition to the official ranking metric, organizers may report supplementary analyses such as per-composer performance to better understand cross-repertoire generalization and robustness across stylistic domains.
Training Datasets
Participants are welcome to train their systems on any dataset, including publicly available corpora, proprietary collections, or internally curated material. There are no restrictions on dataset origin, but we require full transparency in the technical report.
Suggested datasets for training and validation:
- PSyllabus (Ramoneda et al., 2025): 7,901 audio recordings of classical piano pieces with 11-level pedagogical grades, together with mappings to multiple international grading systems. We specifically encourage participants to leverage the multiple ranking annotations available in PSyllabus (13 grading/ranking systems in total), not only the default labels, in order to study and improve generalization across heterogeneous pedagogical traditions and difficulty scales.
- CIPI (Can I Play It?) (Ramoneda et al., 2024): 652 MusicXML piano pieces with 9-level Henle annotations, useful for participants interested in transcription-then-classification pipelines or distant supervision approaches.
- Mikrokosmos-difficulty (Ramoneda et al., 2022) and other publicly available piano difficulty datasets and pedagogical collections.
Participants may use any additional external data, provided that no part of the held-out evaluation repertoire is used directly or indirectly for training. To minimize contamination risks, a list of forbidden test pieces and composers will be published together with the task call.
Please describe in your technical report:
- Dataset names and sources
- Size and number of pieces
- Any preprocessing, cleaning, or label-space remapping applied
Held-Out Test Set
To address the central concern of label trust and contamination, the held-out test set is constructed as a fully private benchmark curated specifically for this task. The dataset is assembled and annotated by expert piano pedagogues, with all annotations independently validated by a pedagogue and musicologist expert. Difficulty labels are assigned using an internal ordinal scale inspired by established pedagogical curricula and examination systems, but the exact mapping, granularity, and total number of levels are intentionally undisclosed to participants.
- The repertoire is selected from sources outside existing public piano difficulty datasets, including contemporary works, underrepresented composers, and recent pedagogical material not present in commonly used corpora. This design minimizes overlap with publicly available graded datasets and reduces the possibility of memorization or contamination effects.
- All annotations undergo independent expert review and adjudication before inclusion in the benchmark. Neither the raw labels nor the precise scale definition are released publicly.
- The held-out scores, audio recordings, and annotations remain private to the task organizers throughout and after the evaluation. Participants submit executable systems (Docker containers preferred), which are run by the organizers on the hidden evaluation set.
Because the underlying difficulty scale is ordinal and intentionally hidden, system performance is evaluated primarily through rank correlation rather than exact class prediction. The official ranking metric is Kendall's Tau-c, which measures agreement between predicted and reference difficulty orderings while remaining robust to unknown category spacing and differing numbers of ordinal levels.
Submission Requirements
The following items are required for submission:
- System: packaged as a Docker container with a standardised inference interface — input is a path to a WAV file; output is a single real-valued difficulty score. A reference Docker template and inference wrapper script will be provided.
- Technical report: 2–4 pages in ISMIR LBD format describing training data, model architecture, label-space handling, and any post-processing.
- Compute declaration: GPU memory footprint and average inference time per piece.
Final submission must be made through the MIREX submission system. Submissions exceeding the 24-hour total inference budget on the full test set, or failing on more than 5% of test items, are reported but excluded from the primary ranking.
Timeline
(To be confirmed in line with the overall MIREX 2026 schedule.)
- TBD: Task call published, training-data mapping released
- TBD: Submission system opens
- TBD: Submission deadline
- TBD: Results announced at ISMIR 2026
Long-term Plan
We are committed to maintaining the task for at least three iterations. We will collaborate with the broader piano-difficulty research community to keep label-space conventions consistent across this task and parallel evaluation efforts. The hidden held-out test set is designed to grow incrementally across editions, with prior-year items optionally re-used subject to confirmation that participants have not gained access.
Given the field's relatively small size, we propose running this task annually for the first two years to bootstrap community engagement, and biennially thereafter. A symbolic-input track may be added in future editions if community interest warrants it.
Organizers
- Huan Zhang (Queen Mary University of London) — huan.zhang@qmul.ac.uk
- Pedro Ramoneda (Songscription) — pedro@songscription.ai
References
- Ramoneda, P., Jeong, D., Eremenko, V., Tamer, N. C., Miron, M., & Serra, X. (2024). Combining piano performance dimensions for score difficulty classification. Expert Systems with Applications, 238, 1–16.
- Ramoneda, P., Lee, M., Jeong, D., Valero-Mas, J. J., & Serra, X. (2025). Can audio reveal music performance difficulty? Insights from the Piano Syllabus Dataset.
- Ramoneda, P., Tamer, N. C., Eremenko, V., Serra, X., & Miron, M. (2022). Score difficulty analysis for piano performance education based on fingering. In ICASSP 2022 – IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 201–205). IEEE.