2026:Music Evaluation via CMI-RewardBench - Revision history

Yinghao Ma: /* Overview */

2026-06-28T23:00:52Z

‎Overview

Yinghao Ma: /* Metrics */

2026-06-28T22:59:27Z

‎Metrics

Yinghao Ma at 14:58, 25 June 2026

2026-06-25T14:58:47Z

Yinghao Ma: /* Task Description */

2026-06-24T13:14:37Z

‎Task Description

Yinghao Ma at 22:27, 10 June 2026

2026-06-10T22:27:04Z

Yinghao Ma: Created page with "= Task Description = The MIREX 2026 Music Evaluation Task challenges participants to develop Reward Models (RMs), musicality verifier, and automatic evaluation metrics capabl..."

2026-06-09T21:12:36Z

Created page with "= Task Description = The MIREX 2026 Music Evaluation Task challenges participants to develop Reward Models (RMs), musicality verifier, and automatic evaluation metrics capabl..."

New page

= Task Description =

The MIREX 2026 Music Evaluation Task challenges participants to develop Reward Models (RMs), musicality verifier, and automatic evaluation metrics capable of accurately predicting human preferences across diverse musical genres. Evaluating generative music remains a critical bottleneck in Music Information Retrieval (MIR) due to the discrepancy between objective acoustic metrics and subjective human preference. This task addresses the urgent need for robust, human-aligned evaluation pipelines by introducing a dynamic framework aligned with the MIREX 2026 call for "Novel Evaluation Pipelines" and "Evaluation under limited resources".

Participants are invited to submit evaluation scripts or models that output preference scores given a prompt and a pair of generated music tracks. The benchmark evaluates the systems on their ability to mirror human judgment, driving forward the development of reward modelling essential for improving music generation performance as well as Reinforcement Learning from Human Feedback (RLHF) in the music domain.

= Dataset =
== Training Dataset ==

Suggested datasets include but are not limited to:
=== AIME Dataset ===
The Audio-Induced Music Evaluation (AIME) dataset serves as a foundational training resource for grounding evaluation models in music assessment tasks.

* URL: [https://huggingface.co/datasets/disco-eth/AIME](https://huggingface.co/datasets/disco-eth/AIME)

=== CMI-pseudo ===
The CMI-pseudo dataset provides large-scale, pseudo-labeled pairwise preference data. It allows models to learn initial ranking characteristics and preferences prior to fine-tuning on high-fidelity human labels.

* URL: [https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo](https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo)

== Test Dataset ==
=== CMI-Pref (Public Test Set) ===
CMI-Pref is a comprehensive public benchmark dataset for reward models in music, containing carefully curated and labeled pairwise comparisons. It serves as the open tracking standard for participants to validate their evaluation pipelines locally.

* URL: [https://huggingface.co/datasets/HaiwenXia/cmi-pref](https://huggingface.co/datasets/HaiwenXia/cmi-pref)

=== A Private Test Set ===
The hidden test set consists of proprietary, "in-the-wild" pairwise human preference comparisons and fresh text-to-music generations collected dynamically
crowdsourced platform over the months leading up to the evaluation. Because this data is kept completely private and released continuously, it ensures zero data leakage and prevents benchmark saturation.

= Baselines =
== CMI-RM ==
The Cross-Modal Music Reward Model (CMI-RM) serves as the primary open-source baseline repository for this task, establishing foundational approaches for training music evaluation models on pairwise alignment.

* URL: [https://github.com/Haiwen-Xia/CMI-RewardBench](https://www.google.com/search?q=https%3A%2F%2Fgithub.com%2FHaiwen-Xia%2FCMI-RewardBench)

= Metrics =
Submissions will be evaluated quantitatively using standard alignment and efficiency metrics defined in the CMI-RewardBench framework ([https://arxiv.org/html/2603.00610v2](https://www.google.com/search?q=https%3A%2F%2Farxiv.org%2Fhtml%2F2603.00610v2)). The main evaluation criteria include:

* Accuracy: Measurement of the model's preference output correlation with crowdsourced human Elo ratings and pairwise preference consensus labels.

* Efficiency: To support the MIREX call for limited-resource evaluation, key resource indicators—including total model parameter size and computational latency—will be logged. Submissions that deliver high human correlation while maintaining minimal resource footprints will be explicitly highlighted.

= Rules =

* Participants are permitted to utilize the specified training sets (AIME and CMI-pseudo) or external data assets, provided all training collections and computational dependencies are fully disclosed in their technical reports.

* Utilising data samples from the official hidden test partitions or deploying optimisation routines against the active live platform such as Music Arena is strictly prohibited.

= Submission =

* Participants must submit either a containerised system (e.g., Docker container) containing their complete evaluation script, runtime environments, and model weights; or an API entry which must expose a standardised API, either way shall be capable of accepting an audio pair alongside its generation prompt, returning a definitive scalar preference score.

* Technical Documentation: Submissions must include a brief technical report summarising model architecture, total training data volume, model parameter size, and computational overhead.

= Paper =

== Submission to ISMIR Late-Breaking Demos ==
There will be no standalone MIREX track this year. Instead, task captains and participants are highly encouraged to compile their technical progress and system findings for submission to the ISMIR 2026 Late-Breaking and Demo (LBD). Accepted participants will have the opportunity to present their work as part of the unified MIREX task captain and participant poster sessions during ISMIR 2026.

@@ Line 10: / Line 10: @@
 * '''Input:''' A JSON or JSONL file containing data items structured similarly to the [[https://huggingface.co/datasets/music-arena/music-arena-dataset|Music Arena Dataset]]. Each entry includes a unique identifier, a text prompt description, and paths or URLs for two audio samples (Audio A and Audio B).
-* '''Output:''' A prediction file containing a preference score indicating the relative quality and alignment, along with a declaration of the preferred candidate.
+* '''Output:''' A prediction file containing a preference score indicating the relative quality and alignment [optional], along with a declaration of the preferred candidate (A or B).
 == A Simple Example ==

@@ Line 35: / Line 35: @@
+{
    "sample_id": "toy_example_001",
    "preferred_candidate": "A"
+}
@@ Line 112: / Line 111: @@
-= Submission =
+== Timeline and Submission Instructions ==
-* Participants must submit either a containerised system (e.g., Docker container) containing their complete evaluation script, runtime environments, and model weights; or an API entry which must expose a standardised API, either way shall be capable of accepting an audio pair alongside its generation prompt, returning a definitive scalar preference score.
+The evaluation follows a rapid-turnaround test phase window in October:
-* Technical Documentation: Submissions must include a brief technical report summarising model architecture, total training data volume, model parameter size, and computational overhead.
+All final prediction files must be uploaded through the official portal:
+* '''Submission Site:''' [https://futuremirex.com/submission/ https://futuremirex.com/submission/]
 = Paper =

@@ Line 97: / Line 97: @@
 Submissions will be evaluated quantitatively using standard alignment and efficiency metrics defined in the CMI-RewardBench framework ([https://arxiv.org/html/2603.00610v2](https://www.google.com/search?q=https%3A%2F%2Farxiv.org%2Fhtml%2F2603.00610v2)). The main evaluation criteria include:
-* Accuracy: Measurement of the model's preference output correlation with crowdsourced human Elo ratings and pairwise preference consensus labels.
+* Accuracy: Measurement of the model's preference output correlation with crowdsourced human ratings (pairwise preference consensus labels).
 = Rules =

@@ Line 1: / Line 1: @@
 = Task Description =
-Participants are invited to submit evaluation scripts or models that output preference scores given a prompt and a pair of generated music tracks. The benchmark evaluates the systems on their ability to mirror human judgment, driving forward the development of reward modelling essential for improving music generation performance as well as Reinforcement Learning from Human Feedback (RLHF) in the music domain.
+The goal of this task is to develop Reward Models (RMs), musicality verifiers, and automatic evaluation metrics capable of accurately predicting human preferences across diverse musical genres. Given an input text prompt and a pair of generated audio candidates, the system must predict which audio tracks better align with human preferences and the prompt.
 = Dataset =

@@ Line 14: / Line 14: @@
 * URL: [https://huggingface.co/datasets/disco-eth/AIME](https://huggingface.co/datasets/disco-eth/AIME)
-=== CMI-pseudo ===
+=== CMI-pseudo and CMI-Pref (train split) ===
 The CMI-pseudo dataset provides large-scale, pseudo-labeled pairwise preference data. It allows models to learn initial ranking characteristics and preferences prior to fine-tuning on high-fidelity human labels.
 * URL: [https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo](https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo)
 == Test Dataset ==
-=== CMI-Pref (Public Test Set) ===
+=== CMI-Pref (test split) -- A Public Test Set ===
 CMI-Pref is a comprehensive public benchmark dataset for reward models in music, containing carefully curated and labeled pairwise comparisons. It serves as the open tracking standard for participants to validate their evaluation pipelines locally.