Difference between revisions of "2026:Music Performance Difficulty Prediction"

From MIREX Wiki
(Created page with "= Music Performance Difficulty Prediction 2026 = Welcome to the official MIREX wiki page for the '''Music Performance Difficulty Prediction''' task, new for MIREX 2026. This...")
 
(Music Performance Difficulty Prediction 2026)
Line 5: Line 5:
 
== Overview ==
 
== Overview ==
  
Music performance difficulty prediction has emerged as an active MIR sub-field over the past five years, with foundational datasets (Mikrokosmos-difficulty, CIPI, PSyllabus) and methods spanning symbolic, image, and audio modalities ([https://arxiv.org/abs/2203.13010 Ramoneda et al., ICASSP 2022]; [https://arxiv.org/abs/2306.08480 2023]; [https://arxiv.org/abs/2309.16287 ISMIR 2023]; [https://arxiv.org/abs/2403.03947 TASLP 2024]; [https://arxiv.org/abs/2408.00473 ISMIR 2024]). Despite this growth, there has been no community-wide evaluation: published results use inconsistent splits, label scales, and metrics, making cross-method comparison difficult.
+
Music performance difficulty prediction has emerged as an active MIR sub-field over the past five years, with foundational datasets (Mikrokosmos-difficulty, CIPI, PSyllabus) and methods spanning symbolic, image-based, and audio-based modalities (Ramoneda et al., 2022–2025). Among these directions, audio-based difficulty estimation remains the most recent and least explored. Unlike symbolic or score-based approaches, audio-only methods do not require machine-readable scores, substantially widening the applicability of difficulty-prediction systems for music libraries, educational platforms, teachers, students, and recommendation systems.
  
This task establishes the first shared benchmark for audio-based music performance difficulty prediction, with a consistent ordinal label space, a held-out test set never publicly released, and standard ordinal regression metrics under composer-disjoint splits.
+
Despite this growing interest, the field still lacks a standardized community benchmark for audio-based difficulty prediction. Existing studies rely on heterogeneous datasets, incompatible grading systems, and non-comparable evaluation protocols. This shared task establishes the first community-wide benchmark for piano difficulty estimation from audio.
  
 
== Task Description ==
 
== Task Description ==
  
Participants submit systems that take a '''solo piano audio recording''' as input and output a predicted difficulty grade on an ordinal '''1–9 scale''', aligned to the Henle Verlag grading system.
+
Participants submit systems that take a '''solo piano audio recording''' as input and output a predicted difficulty score.
  
 
* '''Input format''': WAV, 44.1 kHz, mono or stereo.
 
* '''Input format''': WAV, 44.1 kHz, mono or stereo.
* '''Output format''': integer grade 1–9 together with a 9-dimensional confidence vector.
+
* '''Output format''': a real-valued difficulty score per recording. The score is treated as an ordering — its absolute scale, range, or number of categories does not need to match the (hidden) evaluation scale.
* '''Two recordings per piece''': each test piece is represented by at least two recordings — one human performance and one synthesized rendering — scored independently and aggregated per piece. This is designed so that systems cannot game the task by relying purely on performance-quality or recording-condition cues.
+
* '''Two recordings per piece''': each test piece is represented by at least two recordings — one human performance and one synthesized score rendering from a fixed soundfont — scored independently and aggregated per piece. This is designed so that systems cannot game the task by relying purely on performance-quality or recording-condition cues.
  
This year we deliberately focus on '''audio only''' to align with the broader applicability of audio-based difficulty prediction tools (no machine-readable score required) and to attract participants from the audio representation learning community. A symbolic-input track may be added in future editions.
+
There is no inference-time limit per piece, but the full test set must be processed within a 24-hour wall-clock budget on a single GPU.
 +
 
 +
== Significance of the Task ==
 +
 
 +
This shared task establishes the first community-wide benchmark for piano difficulty estimation from audio, featuring:
 +
 
 +
* A consistent ordinal evaluation framework
 +
* A fully private held-out test set curated specifically for this task
 +
* Standardized ordinal regression and ranking metrics
 +
* Composer-disjoint evaluation splits designed to measure true generalization beyond repertoire memorization
 +
* A hidden pedagogical difficulty scale validated by expert pedagogues and musicologists
 +
 
 +
Beyond benchmarking, the task aims to lower the entry barrier for researchers working on audio representation learning, multimodal learning, and music understanding, while fostering a growing research community at the intersection of MIR, pedagogy, and computational music education.
  
 
== Evaluation Criteria ==
 
== Evaluation Criteria ==
  
Submissions are ranked using standard ordinal regression metrics computed on the held-out test set:
+
During development, participants are encouraged to evaluate their systems on public datasets such as PSyllabus using standard ordinal-regression metrics, including:
  
* '''Mean Squared Error (MSE)''' — primary ranking metric
+
* Mean Squared Error (MSE)
* '''Accuracy within one level (Acc±1)''' — fraction of predictions within one grade of the true label
+
* Accuracy within one level (Acc±1)
* '''Balanced accuracy''' — per-class accuracy averaged across grade levels, to handle class imbalance
+
* Balanced accuracy
* '''Spearman ρ''' — rank correlation between predicted and ground-truth grades
+
* Spearman's ρ
  
The primary ranking uses MSE; ties are broken by Acc±1. We additionally report per-difficulty-level confusion matrices, per-composer error, and the gap between human-recording and synthesized-recording scores on the same piece (to surface systems that overfit to performance-quality cues).
+
These metrics are useful for model selection and comparison on publicly annotated datasets with explicit grade structures.
 +
 
 +
However, because the official evaluation set uses a '''hidden ordinal pedagogical scale''' whose granularity and number of levels are not disclosed, the shared-task ranking is based exclusively on '''Kendall's Tau-c''', which measures agreement between predicted and reference orderings without assuming fixed distances between categories.
 +
 
 +
In addition to the official ranking metric, organizers may report supplementary analyses such as per-composer performance to better understand cross-repertoire generalization and robustness across stylistic domains.
  
 
== Training Datasets ==
 
== Training Datasets ==
Line 36: Line 52:
 
Suggested datasets for training and validation:
 
Suggested datasets for training and validation:
  
* '''[https://zenodo.org/records/12783403 PSyllabus]''' (Ramoneda et al., 2024): 7,901 audio recordings of classical piano pieces with 11-level grades. A provided mapping to the Henle 1–9 label space will be released alongside the task.
+
* '''[https://zenodo.org/records/12783403 PSyllabus]''' (Ramoneda et al., 2025): 7,901 audio recordings of classical piano pieces with 11-level pedagogical grades, together with mappings to multiple international grading systems. We specifically encourage participants to leverage the multiple ranking annotations available in PSyllabus (13 grading/ranking systems in total), not only the default labels, in order to study and improve generalization across heterogeneous pedagogical traditions and difficulty scales.
* '''[https://zenodo.org/records/8037327 CIPI (Can I Play It?)]''' (Ramoneda et al., 2023): 652 MusicXML pieces with 9-level Henle labels. Useful for participants who wish to transcribe-then-classify or for distant supervision.
+
* '''[https://zenodo.org/records/8037327 CIPI (Can I Play It?)]''' (Ramoneda et al., 2024): 652 MusicXML piano pieces with 9-level Henle annotations, useful for participants interested in transcription-then-classification pipelines or distant supervision approaches.
* '''[https://github.com/PRamoneda/Mikrokosmos-difficulty Mikrokosmos-difficulty]''': 147 Bartók pieces with 3 difficulty levels.
+
* '''[https://github.com/PRamoneda/Mikrokosmos-difficulty Mikrokosmos-difficulty]''' (Ramoneda et al., 2022) and other publicly available piano difficulty datasets and pedagogical collections.
* Other publicly available difficulty collections (Pianostreet-difficulty, Freescore-difficulty, Hidden Voices) may also be used.
 
  
'''Important''': no part of the held-out test set may be used for training. A list of forbidden pieces will be published with the task call. Please describe in your technical report:
+
Participants may use any additional external data, provided that no part of the held-out evaluation repertoire is used directly or indirectly for training. To minimize contamination risks, a list of forbidden test pieces and composers will be published together with the task call.
* Dataset name and source
+
 
 +
Please describe in your technical report:
 +
* Dataset names and sources
 
* Size and number of pieces
 
* Size and number of pieces
 
* Any preprocessing, cleaning, or label-space remapping applied
 
* Any preprocessing, cleaning, or label-space remapping applied
Line 48: Line 65:
 
== Held-Out Test Set ==
 
== Held-Out Test Set ==
  
To ensure label trust and prevent training-set contamination, the held-out test set (target size '''≈ 200 pieces''') is constructed by the task captain in three stages and is '''not publicly released''':
+
To address the central concern of label trust and contamination, the held-out test set is constructed as a '''fully private benchmark''' curated specifically for this task. The dataset is assembled and annotated by expert piano pedagogues, with all annotations independently validated by a pedagogue and musicologist expert. Difficulty labels are assigned using an internal ordinal scale inspired by established pedagogical curricula and examination systems, but the exact mapping, granularity, and total number of levels are intentionally undisclosed to participants.
  
# '''Piece selection''': drawn from sources outside existing public difficulty datasets — (a) Henle Verlag catalogue additions published after the CIPI cutoff; (b) ABRSM, Trinity, and RCM 2024–2026 syllabus pieces not present in PSyllabus; (c) a curated set of contemporary and historically underrepresented composers (e.g. Hidden Voices) graded specifically for this task.
+
* The repertoire is selected from sources outside existing public piano difficulty datasets, including contemporary works, underrepresented composers, and recent pedagogical material not present in commonly used corpora. This design minimizes overlap with publicly available graded datasets and reduces the possibility of memorization or contamination effects.
# '''Grade verification''': each piece is graded independently by at least two expert pianists recruited from the QMUL music programme and the ABRSM examiner network on the Henle 1–9 scale. Inter-annotator agreement is reported alongside the final results. Disagreements >1 level are resolved by a third annotator.
+
* All annotations undergo independent expert review and adjudication before inclusion in the benchmark. Neither the raw labels nor the precise scale definition are released publicly.
# '''Audio rendering''': for each piece, at least two reference recordings are produced — (i) a human performance, either recorded on a Yamaha Disklavier at QMUL or sourced from licensed commercial recordings where redistribution of features is permitted; and (ii) a high-quality synthesized rendering from the verified score, generated with a fixed renderer.
+
* The held-out scores, audio recordings, and annotations remain private to the task organizers throughout and after the evaluation. Participants submit executable systems (Docker containers preferred), which are run by the organizers on the hidden evaluation set.
  
'''Test set governance''': the held-out audio files and labels remain private to the task captain. Participants submit systems; the task captain runs inference on the held-out set and reports metrics back. Test recordings and labels are not released after the competition, enabling re-use and incremental growth across editions.
+
Because the underlying difficulty scale is ordinal and intentionally hidden, system performance is evaluated primarily through rank correlation rather than exact class prediction. The official ranking metric is '''Kendall's Tau-c''', which measures agreement between predicted and reference difficulty orderings while remaining robust to unknown category spacing and differing numbers of ordinal levels.
  
 
== Submission Requirements ==
 
== Submission Requirements ==
Line 60: Line 77:
 
The following items are required for submission:
 
The following items are required for submission:
  
* '''System''': packaged as a Docker container with a standardised inference interface — input is a path to a WAV file; output is an integer grade 1–9 with a 9-dimensional confidence vector. A reference Docker template and inference wrapper script will be provided.
+
* '''System''': packaged as a Docker container with a standardised inference interface — input is a path to a WAV file; output is a single real-valued difficulty score. A reference Docker template and inference wrapper script will be provided.
 
* '''Technical report''': 2–4 pages in ISMIR LBD format describing training data, model architecture, label-space handling, and any post-processing.
 
* '''Technical report''': 2–4 pages in ISMIR LBD format describing training data, model architecture, label-space handling, and any post-processing.
 
* '''Compute declaration''': GPU memory footprint and average inference time per piece.
 
* '''Compute declaration''': GPU memory footprint and average inference time per piece.
Line 74: Line 91:
 
* '''TBD''': Submission deadline
 
* '''TBD''': Submission deadline
 
* '''TBD''': Results announced at ISMIR 2026
 
* '''TBD''': Results announced at ISMIR 2026
 +
 +
== Long-term Plan ==
 +
 +
We are committed to maintaining the task for at least three iterations. We will collaborate with the broader piano-difficulty research community to keep label-space conventions consistent across this task and parallel evaluation efforts. The hidden held-out test set is designed to grow incrementally across editions, with prior-year items optionally re-used subject to confirmation that participants have not gained access.
 +
 +
Given the field's relatively small size, we propose running this task annually for the first two years to bootstrap community engagement, and biennially thereafter. A symbolic-input track may be added in future editions if community interest warrants it.
  
 
== Organizers ==
 
== Organizers ==
  
* '''Huan Zhang''' (Task Captain, Queen Mary University of London)
+
* '''Huan Zhang''' (Queen Mary University of London) — [mailto:huan.zhang@qmul.ac.uk huan.zhang@qmul.ac.uk]
* '''Pedro Ramoneda''' (Universitat Pompeu Fabra)
+
* '''Pedro Ramoneda''' (Songscription) — [mailto:pedro@songscription.ai pedro@songscription.ai]
 +
 
 +
== References ==
  
Contact: [mailto:huan.zhang@qmul.ac.uk huan.zhang@qmul.ac.uk]
+
* Ramoneda, P., Jeong, D., Eremenko, V., Tamer, N. C., Miron, M., & Serra, X. (2024). Combining piano performance dimensions for score difficulty classification. ''Expert Systems with Applications'', 238, 1–16.
 +
* Ramoneda, P., Lee, M., Jeong, D., Valero-Mas, J. J., & Serra, X. (2025). Can audio reveal music performance difficulty? Insights from the Piano Syllabus Dataset.
 +
* Ramoneda, P., Tamer, N. C., Eremenko, V., Serra, X., & Miron, M. (2022). Score difficulty analysis for piano performance education based on fingering. In ''ICASSP 2022 – IEEE International Conference on Acoustics, Speech and Signal Processing'' (pp. 201–205). IEEE.

Revision as of 18:06, 30 May 2026

Music Performance Difficulty Prediction 2026

Welcome to the official MIREX wiki page for the Music Performance Difficulty Prediction task, new for MIREX 2026. This task targets the automatic estimation of how technically and musically demanding a piece is to perform from audio recordings, supporting applications in music education, library cataloguing, and pedagogical recommendation systems.

Overview

Music performance difficulty prediction has emerged as an active MIR sub-field over the past five years, with foundational datasets (Mikrokosmos-difficulty, CIPI, PSyllabus) and methods spanning symbolic, image-based, and audio-based modalities (Ramoneda et al., 2022–2025). Among these directions, audio-based difficulty estimation remains the most recent and least explored. Unlike symbolic or score-based approaches, audio-only methods do not require machine-readable scores, substantially widening the applicability of difficulty-prediction systems for music libraries, educational platforms, teachers, students, and recommendation systems.

Despite this growing interest, the field still lacks a standardized community benchmark for audio-based difficulty prediction. Existing studies rely on heterogeneous datasets, incompatible grading systems, and non-comparable evaluation protocols. This shared task establishes the first community-wide benchmark for piano difficulty estimation from audio.

Task Description

Participants submit systems that take a solo piano audio recording as input and output a predicted difficulty score.

  • Input format: WAV, 44.1 kHz, mono or stereo.
  • Output format: a real-valued difficulty score per recording. The score is treated as an ordering — its absolute scale, range, or number of categories does not need to match the (hidden) evaluation scale.
  • Two recordings per piece: each test piece is represented by at least two recordings — one human performance and one synthesized score rendering from a fixed soundfont — scored independently and aggregated per piece. This is designed so that systems cannot game the task by relying purely on performance-quality or recording-condition cues.

There is no inference-time limit per piece, but the full test set must be processed within a 24-hour wall-clock budget on a single GPU.

Significance of the Task

This shared task establishes the first community-wide benchmark for piano difficulty estimation from audio, featuring:

  • A consistent ordinal evaluation framework
  • A fully private held-out test set curated specifically for this task
  • Standardized ordinal regression and ranking metrics
  • Composer-disjoint evaluation splits designed to measure true generalization beyond repertoire memorization
  • A hidden pedagogical difficulty scale validated by expert pedagogues and musicologists

Beyond benchmarking, the task aims to lower the entry barrier for researchers working on audio representation learning, multimodal learning, and music understanding, while fostering a growing research community at the intersection of MIR, pedagogy, and computational music education.

Evaluation Criteria

During development, participants are encouraged to evaluate their systems on public datasets such as PSyllabus using standard ordinal-regression metrics, including:

  • Mean Squared Error (MSE)
  • Accuracy within one level (Acc±1)
  • Balanced accuracy
  • Spearman's ρ

These metrics are useful for model selection and comparison on publicly annotated datasets with explicit grade structures.

However, because the official evaluation set uses a hidden ordinal pedagogical scale whose granularity and number of levels are not disclosed, the shared-task ranking is based exclusively on Kendall's Tau-c, which measures agreement between predicted and reference orderings without assuming fixed distances between categories.

In addition to the official ranking metric, organizers may report supplementary analyses such as per-composer performance to better understand cross-repertoire generalization and robustness across stylistic domains.

Training Datasets

Participants are welcome to train their systems on any dataset, including publicly available corpora, proprietary collections, or internally curated material. There are no restrictions on dataset origin, but we require full transparency in the technical report.

Suggested datasets for training and validation:

  • PSyllabus (Ramoneda et al., 2025): 7,901 audio recordings of classical piano pieces with 11-level pedagogical grades, together with mappings to multiple international grading systems. We specifically encourage participants to leverage the multiple ranking annotations available in PSyllabus (13 grading/ranking systems in total), not only the default labels, in order to study and improve generalization across heterogeneous pedagogical traditions and difficulty scales.
  • CIPI (Can I Play It?) (Ramoneda et al., 2024): 652 MusicXML piano pieces with 9-level Henle annotations, useful for participants interested in transcription-then-classification pipelines or distant supervision approaches.
  • Mikrokosmos-difficulty (Ramoneda et al., 2022) and other publicly available piano difficulty datasets and pedagogical collections.

Participants may use any additional external data, provided that no part of the held-out evaluation repertoire is used directly or indirectly for training. To minimize contamination risks, a list of forbidden test pieces and composers will be published together with the task call.

Please describe in your technical report:

  • Dataset names and sources
  • Size and number of pieces
  • Any preprocessing, cleaning, or label-space remapping applied

Held-Out Test Set

To address the central concern of label trust and contamination, the held-out test set is constructed as a fully private benchmark curated specifically for this task. The dataset is assembled and annotated by expert piano pedagogues, with all annotations independently validated by a pedagogue and musicologist expert. Difficulty labels are assigned using an internal ordinal scale inspired by established pedagogical curricula and examination systems, but the exact mapping, granularity, and total number of levels are intentionally undisclosed to participants.

  • The repertoire is selected from sources outside existing public piano difficulty datasets, including contemporary works, underrepresented composers, and recent pedagogical material not present in commonly used corpora. This design minimizes overlap with publicly available graded datasets and reduces the possibility of memorization or contamination effects.
  • All annotations undergo independent expert review and adjudication before inclusion in the benchmark. Neither the raw labels nor the precise scale definition are released publicly.
  • The held-out scores, audio recordings, and annotations remain private to the task organizers throughout and after the evaluation. Participants submit executable systems (Docker containers preferred), which are run by the organizers on the hidden evaluation set.

Because the underlying difficulty scale is ordinal and intentionally hidden, system performance is evaluated primarily through rank correlation rather than exact class prediction. The official ranking metric is Kendall's Tau-c, which measures agreement between predicted and reference difficulty orderings while remaining robust to unknown category spacing and differing numbers of ordinal levels.

Submission Requirements

The following items are required for submission:

  • System: packaged as a Docker container with a standardised inference interface — input is a path to a WAV file; output is a single real-valued difficulty score. A reference Docker template and inference wrapper script will be provided.
  • Technical report: 2–4 pages in ISMIR LBD format describing training data, model architecture, label-space handling, and any post-processing.
  • Compute declaration: GPU memory footprint and average inference time per piece.

Final submission must be made through the MIREX submission system. Submissions exceeding the 24-hour total inference budget on the full test set, or failing on more than 5% of test items, are reported but excluded from the primary ranking.

Timeline

(To be confirmed in line with the overall MIREX 2026 schedule.)

  • TBD: Task call published, training-data mapping released
  • TBD: Submission system opens
  • TBD: Submission deadline
  • TBD: Results announced at ISMIR 2026

Long-term Plan

We are committed to maintaining the task for at least three iterations. We will collaborate with the broader piano-difficulty research community to keep label-space conventions consistent across this task and parallel evaluation efforts. The hidden held-out test set is designed to grow incrementally across editions, with prior-year items optionally re-used subject to confirmation that participants have not gained access.

Given the field's relatively small size, we propose running this task annually for the first two years to bootstrap community engagement, and biennially thereafter. A symbolic-input track may be added in future editions if community interest warrants it.

Organizers

References

  • Ramoneda, P., Jeong, D., Eremenko, V., Tamer, N. C., Miron, M., & Serra, X. (2024). Combining piano performance dimensions for score difficulty classification. Expert Systems with Applications, 238, 1–16.
  • Ramoneda, P., Lee, M., Jeong, D., Valero-Mas, J. J., & Serra, X. (2025). Can audio reveal music performance difficulty? Insights from the Piano Syllabus Dataset.
  • Ramoneda, P., Tamer, N. C., Eremenko, V., Serra, X., & Miron, M. (2022). Score difficulty analysis for piano performance education based on fingering. In ICASSP 2022 – IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 201–205). IEEE.