2026:Music Evaluation via CMI-RewardBench
Task Description
The MIREX 2026 Music Evaluation Task challenges participants to develop Reward Models (RMs), musicality verifier, and automatic evaluation metrics capable of accurately predicting human preferences across diverse musical genres. Evaluating generative music remains a critical bottleneck in Music Information Retrieval (MIR) due to the discrepancy between objective acoustic metrics and subjective human preference. This task addresses the urgent need for robust, human-aligned evaluation pipelines by introducing a dynamic framework aligned with the MIREX 2026 call for "Novel Evaluation Pipelines" and "Evaluation under limited resources".
Participants are invited to submit evaluation scripts or models that output preference scores given a prompt and a pair of generated music tracks. The benchmark evaluates the systems on their ability to mirror human judgment, driving forward the development of reward modelling essential for improving music generation performance as well as Reinforcement Learning from Human Feedback (RLHF) in the music domain.
Dataset
Training Dataset
Suggested datasets include but are not limited to:
AIME Dataset
The Audio-Induced Music Evaluation (AIME) dataset serves as a foundational training resource for grounding evaluation models in music assessment tasks.
CMI-pseudo and CMI-Pref (train split)
The CMI-pseudo dataset provides large-scale, pseudo-labeled pairwise preference data. It allows models to learn initial ranking characteristics and preferences prior to fine-tuning on high-fidelity human labels.
Music Arena
The Music Arena dataset is an open, crowdsourced live evaluation corpus compiled from blind pairwise human preference votes on text-to-music system generations. For robust benchmarking, the historical data collected from July 2025 to December 2025 is strictly sequestered as a dedicated pairwise preference test split to guarantee out-of-distribution evaluation.
SongEval
The SongEval dataset is a large-scale, open-source music aesthetic benchmark featuring over 140 hours of high-quality, full-length vocal and accompaniment tracks across 9 mainstream genres. It is evaluated via fine-grained human annotations from 16 professional musicians who grade each track on a 1–5 scale spanning five key dimensions: overall coherence, musicality, memorability, structural clarity, and vocal naturalness.
Test Dataset
CMI-Pref (test split) -- A Public Test Set
CMI-Pref is a comprehensive public benchmark dataset for reward models in music, containing carefully curated and labeled pairwise comparisons. It serves as the open tracking standard for participants to validate their evaluation pipelines locally.
A Private Test Set
The hidden test set consists of proprietary, "in-the-wild" pairwise human preference comparisons and fresh text-to-music generations collected dynamically crowdsourced platform over the months leading up to the evaluation. Because this data is kept completely private and released continuously, it ensures zero data leakage and prevents benchmark saturation.
Baselines
CMI-RM
The Cross-Modal Music Reward Model (CMI-RM) serves as the primary open-source baseline repository for this task, establishing foundational approaches for training music evaluation models on pairwise alignment.
Metrics
Submissions will be evaluated quantitatively using standard alignment and efficiency metrics defined in the CMI-RewardBench framework ([5](https://www.google.com/search?q=https%3A%2F%2Farxiv.org%2Fhtml%2F2603.00610v2)). The main evaluation criteria include:
- Accuracy: Measurement of the model's preference output correlation with crowdsourced human Elo ratings and pairwise preference consensus labels.
- Efficiency: To support the MIREX call for limited-resource evaluation, key resource indicators—including total model parameter size and computational latency—will be logged. Submissions that deliver high human correlation while maintaining minimal resource footprints will be explicitly highlighted.
Rules
- Participants are permitted to utilize the specified training sets (AIME and CMI-pseudo) or external data assets, provided all training collections and computational dependencies are fully disclosed in their technical reports.
- Utilising data samples from the official hidden test partitions or deploying optimisation routines against the active live platform such as Music Arena is strictly prohibited.
Submission
- Participants must submit either a containerised system (e.g., Docker container) containing their complete evaluation script, runtime environments, and model weights; or an API entry which must expose a standardised API, either way shall be capable of accepting an audio pair alongside its generation prompt, returning a definitive scalar preference score.
- Technical Documentation: Submissions must include a brief technical report summarising model architecture, total training data volume, model parameter size, and computational overhead.
Paper
Submission to ISMIR Late-Breaking Demos
There will be no standalone MIREX track this year. Instead, task captains and participants are highly encouraged to compile their technical progress and system findings for submission to the ISMIR 2026 Late-Breaking and Demo (LBD). Accepted participants will have the opportunity to present their work as part of the unified MIREX task captain and participant poster sessions during ISMIR 2026.