2025:Music Reasoning QA
Task Description
The MIREX 2025 Music Reasoning Question Answering (QA) Task challenges participants to develop models capable of answering natural language questions that require understanding and reasoning over musical audio. This task seeks to advance the frontier of machine music intelligence by evaluating models on their ability to reason about all kinds of music information musical structure, instrument presence, melody information, vocal content, and environmental context etc., along with knowledge in music theory and music history.
Participants will build systems that answer multiple-choice questions grounded in audio inputs. The task includes questions from four curated subsets (Music, Music-Speech, Sound-Music, Sound-Music-Speech) from the MMAR benchmark, and Music-subset with image caption from the OmniBench benchmark. Each question is paired with an audio clip and 2-4 different choices.
Systems will be evaluated based on the accuracy of multiple-choice questions. This task encourages research on multimodal understanding, few-shot learning, and real-world music audio analysis.
Dataset
Test Dataset: MMAR benchmark
MMAR is a comprehensive benchmark specifically designed to evaluate deep reasoning capabilities in Audio-Language Models (ALMs). Unlike prior datasets that focus narrowly on individual domains such as speech, music, or general sound events, MMAR introduces a broad, realistic, and interdisciplinary collection of tasks spanning diverse audio types and complex multimodal interactions. It comprises 1,000 carefully curated question-answer pairs, each associated with a short audio clip averaging 20 seconds in length. These audio samples are sourced from real-world internet videos and include a rich variety of natural auditory scenes. The dataset covers seven audio modality categories, reflecting both pure and mixed content: Sound, Music, Speech, Music-Speech, Music-Sound, Speech-Sound, Music-Speech-Sound.
Each QA instance in MMAR is crafted to test multi-step reasoning beyond surface-level recognition. The questions are organized into a hierarchical reasoning taxonomy with four levels:
- Signal-level (e.g., pitch, rhythm, volume)
- Perception-level (e.g., emotion, instrument timbre)
- Semantic-level (e.g., inferred activity or scene)
- Cultural-level (e.g., genre, style, or symbolic references)
In addition to the question and correct answer, MMAR provides multiple-choice format with 2–4 answer candidates, audio domain, task type, and reasoning category.
MMAR's development process involved expert annotators, LLM-assisted taxonomy design, and multiple stages of quality assurance to ensure annotation accuracy, question clarity, and reasoning depth. Tasks often require integration of perceptual judgment, symbolic interpretation, and contextual knowledge, some at a graduate-level difficulty.
For the MIREX 2025 Music Reasoning QA Task, we focus on a subset of MMAR that includes all questions in the music-speech, music-sound, and music-speech-sound categories. These subsets test the model’s ability to reason about musical audio in context—e.g., understanding musical scenes that include speech overlays, environmental sound, or expressive musical gestures embedded in noisy backgrounds.
- Ma, Ziyang, et al. "MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix." arXiv preprint arXiv:2505.13032 (2025).
Test Dataset: OmniBench benchmark
OmniBench is a tri-modal benchmark designed to evaluate how well multimodal large language models (MLLMs) reason over image, audio, and text inputs. It contains 1,142 multiple-choice QA tasks, each paired with a high-resolution image, a short audio clip (up to 30 seconds), and a natural language question with 4 answer choices.
The audio content spans speech, sound events, and music, making OmniBench suitable for testing real-world, multimodal reasoning. Questions are categorized into tasks including Action & Activity, Story Description, Plot Inference, Object Identification & Description, Contextual & Environmental, Identity & Relationship, Text & Symbols, Count & Quantity. Each question is designed to require information from at least two modalities, with verified rationales ensuring high quality.
For the MIREX 2025 Music Reasoning QA Task, we use the music-only subset of OmniBench. In this subset, each question is paired with:
- A music audio clip
- A visual image related to the musical context
- A multiple-choice question that requires interpreting both modalities to answer correctly
This setting is designed to test a model's ability to perform music-centered reasoning in multimodal environments—for example, identifying instruments from both sound and image, inferring performance context, or detecting genre-related cues that require combined auditory and visual understanding. Models are expected to process both the music audio and the accompanying image caption to generate the correct response.
- Li, Yizhi, et al. "Omnibench: Towards the future of universal omni-language models." arXiv preprint arXiv:2409.15272 (2024).
Traiining Dataset: CoTA dataset
The Chain-of-Thought Audio (CoTA) is a large-scale dataset designed to train and evaluate models on complex, structured audio reasoning tasks. It contains 1.2 million QA pairs across three audio domains—music, speech, and environmental sound—and is specifically built to support multi-step inference using audio as the primary input.
Each sample in CoTA includes:
- A short audio clip (≤ 30 seconds)
- A question targeting reasoning over acoustic content
- A correct answer (multiple choice)
- A detailed Chain-of-Thought (CoT) explanation showing intermediate reasoning steps
CoTA questions span a wide range of reasoning skills: from basic acoustic analysis and speech transcription to music emotion understanding, genre inference, and multi-speaker context modeling. It integrates data from real-world datasets (e.g., AudioSet, MusicBench, CoVoST2) and synthetic sources to ensure diversity and scalability.
- Xie, Zhifei, et al. "Audio-reasoner: Improving reasoning capability in large audio language models." arXiv preprint arXiv:2503.02318 (2025).
Baselines
Traditional Music Caption and QA Models:MusiLingo
- Deng, Zihao, et al. "MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response." Findings of the Association for Computational Linguistics: NAACL 2024. 2024.
Large Audio Language Models: Qwen2-audio
- Qwen2-audio: Chu, Yunfei, et al. "Qwen2-audio technical report." arXiv preprint arXiv:2407.10759 (2024).
Large Audion Reasoning Models: Audio-Reasoner
- Audio-Reasoner: Xie, Zhifei, et al. "Audio-reasoner: Improving reasoning capability in large audio language models." arXiv preprint arXiv:2503.02318 (2025).
Large Omni Language Models: Qwen2.5-Omni
- Xu, Jin, et al. "Qwen2. 5-omni technical report." arXiv preprint arXiv:2503.20215 (2025).
Metrics
Submissions will be evaluated using accuracy, defined as the proportion of correctly answered questions out of the total number of questions. Each question in the benchmark is a multiple-choice question with 2 to 4 candidate answers, and exactly one correct answer. For each test sample, the model must select the single most likely answer choice. No partial credit is awarded. In cases where the model fails to return a valid choice, the prediction is counted as incorrect.
Optional qualitative analysis may be conducted on a subset of model responses to assess reasoning plausibility, but only accuracy will be used for official ranking.
Rules
- Participants are allowed to utilise external datasets and pre-trained models in developing their systems if stated in their technique report. However, the use of the data samples in the MMAR benchmark and the OmniBench benchmark for training or validation is strictly prohibited.
- All submissions must respect the CC-BY(-NC) license under which the MMAR and OmniBench are released.
Submission
- Submissions will be evaluated using CodaBench (TBD) for automated assessment.
- Submission Deadline: 1st Sept., AOE
- Participants are required to submit the following:
- JSON file
- A JSON file containing the choice for the evaluation dataset. The format should match the structure provided in the MMAR repo and OmniBench repo.
- PDF file
- A PDF file detailing the system architecture, training process, and any external data or models used. This should include a clear statement that the Song Describer dataset was not used for training or validation.
- Example of JSON
- TBA
Paper
Submission to LLM4Music Satellite Event & LBD
Participants are encouraged to submit the technical report to ISMIR, including but not limited to LBD track or the LLM4Music Satellite Event at ISMIR 2025. You can get reviewer opinions and benefit to your following submission to main conference or journal next year.
Workshop presentation
We will invite top-ranked participants to present their work during the MIREX workshop session. The format will be hybrid to accommodate remote participation.