2026:Music Evaluation via CMI-RewardBench
Task Description
Evaluating generative music remains a critical bottleneck in Music Information Retrieval (MIR) due to the discrepancy between objective acoustic metrics and subjective human preference. This task addresses the urgent need for robust, human-aligned evaluation pipelines by introducing a dynamic framework aligned with the MIREX 2026 call for "Novel Evaluation Pipelines" and "Evaluation under limited resources".
Participants are invited to submit evaluation scripts or models that output preference scores given a prompt and a pair of generated music tracks. The benchmark evaluates the systems on their ability to mirror human judgment, driving forward the development of reward modelling essential for improving music generation performance as well as Reinforcement Learning from Human Feedback (RLHF) in the music domain.
Overview
The goal of this task is to develop Reward Models (RMs), musicality verifiers, and automatic evaluation metrics capable of accurately predicting human preferences across diverse musical genres. Given an input text prompt and a pair of generated audio candidates, the system must predict which audio tracks better align with human preferences and the prompt.
- Input: A JSON or JSONL file containing data items structured similarly to the [Arena Dataset]. Each entry includes a unique identifier, a text prompt description, and paths or URLs for two audio samples (Audio A and Audio B).
- Output: A prediction file containing a preference score indicating the relative quality and alignment, along with a declaration of the preferred candidate.
A Simple Example
Suppose the input prompt is:
ambient serene instrumental
The system is given two generated music tracks:
| Candidate | Audio |
|---|---|
| Candidate A | cmi-pref/gen-audio/5b6bc40ea307.mp3 |
| Candidate B | cmi-pref/gen-audio/2cc22000de01.mp3 |
A submitted model should compare the two audio candidates with respect to the prompt and return a prediction such as:
{
"sample_id": "toy_example_001",
"preferred_candidate": "A"
}
Submission and Environment Requirements
To ensure a standardized and reproducible evaluation under limited resources, each submitted model must be self-contained within its own environment.
Models will be executed via a standard command line interface. The framework will pass a path to the JSON/JSONL input list of audio pairs using the following syntax:
python main.py --path /path/to/input_data.jsonl
Your script should parse the data format, execute internal inference/scoring, and save or output the final preference results.
Dataset
Training Dataset
Suggested datasets include but are not limited to:
AIME Dataset
The Audio-Induced Music Evaluation (AIME) dataset serves as a foundational training resource for grounding evaluation models in music assessment tasks.
CMI-pseudo and CMI-Pref (train split)
The CMI-pseudo dataset provides large-scale, pseudo-labeled pairwise preference data. It allows models to learn initial ranking characteristics and preferences prior to fine-tuning on high-fidelity human labels.
Music Arena
The Music Arena dataset is an open, crowdsourced live evaluation corpus compiled from blind pairwise human preference votes on text-to-music system generations. For robust benchmarking, the historical data collected from July 2025 to December 2025 is strictly sequestered as a dedicated pairwise preference test split to guarantee out-of-distribution evaluation.
SongEval
The SongEval dataset is a large-scale, open-source music aesthetic benchmark featuring over 140 hours of high-quality, full-length vocal and accompaniment tracks across 9 mainstream genres. It is evaluated via fine-grained human annotations from 16 professional musicians who grade each track on a 1–5 scale spanning five key dimensions: overall coherence, musicality, memorability, structural clarity, and vocal naturalness.
Test Dataset
CMI-Pref (test split) -- A Public Test Set
CMI-Pref is a comprehensive public benchmark dataset for reward models in music, containing carefully curated and labeled pairwise comparisons. It serves as the open tracking standard for participants to validate their evaluation pipelines locally.
A Private Test Set
The hidden test set consists of proprietary, "in-the-wild" pairwise human preference comparisons and fresh text-to-music generations collected dynamically crowdsourced platform over the months leading up to the evaluation. Because this data is kept completely private and released continuously, it ensures zero data leakage and prevents benchmark saturation.
Baselines
CMI-RM
The Cross-Modal Music Reward Model (CMI-RM) serves as the primary open-source baseline repository for this task, establishing foundational approaches for training music evaluation models on pairwise alignment.
Metrics
Submissions will be evaluated quantitatively using standard alignment and efficiency metrics defined in the CMI-RewardBench framework ([5](https://www.google.com/search?q=https%3A%2F%2Farxiv.org%2Fhtml%2F2603.00610v2)). The main evaluation criteria include:
- Accuracy: Measurement of the model's preference output correlation with crowdsourced human Elo ratings and pairwise preference consensus labels.
- Efficiency: To support the MIREX call for limited-resource evaluation, key resource indicators—including total model parameter size and computational latency—will be logged. Submissions that deliver high human correlation while maintaining minimal resource footprints will be explicitly highlighted.
Rules
- Participants are permitted to utilize the specified training sets (AIME and CMI-pseudo) or external data assets, provided all training collections and computational dependencies are fully disclosed in their technical reports.
- Utilising data samples from the official hidden test partitions or deploying optimisation routines against the active live platform such as Music Arena is strictly prohibited.
Timeline and Submission Instructions
The evaluation follows a rapid-turnaround test phase window in October:
- October 1st: The evaluation team will release the official music audio files and the corresponding evaluation JSONL file.
- October 2nd (AOE): Deadline to generate and upload your predictions. You must submit the completed JSONL file containing the preference predictions for each of the given audio pairs.
All final prediction files must be uploaded through the official portal:
- Submission Site: https://futuremirex.com/submission/
Paper
Submission to ISMIR Late-Breaking Demos
There will be no standalone MIREX track this year. Instead, task captains and participants are highly encouraged to compile their technical progress and system findings for submission to the ISMIR 2026 Late-Breaking and Demo (LBD). Accepted participants will have the opportunity to present their work as part of the unified MIREX task captain and participant poster sessions during ISMIR 2026.