2026:AI-Generated Music Detection
Contents
Task Description
The MIREX 2026 AI-Generated Music Detection Task invites participants to develop systems that can detect whether a music recording is fully AI-generated or human-made.
In the first iteration of this task, we focus on a simple and accessible binary setting:
- Positive: the recording is fully AI-generated music.
- Negative: the recording is real human-made music.
Participants are asked to submit systems that take a music audio recording as input and output a probability score between 0 and 1. A higher score indicates that the system believes the recording is more likely to be AI-generated.
This first-year task is intentionally limited to full-song AI-generated music detection. More complex cases, such as AI-generated stems, AI-assisted remixing, localized AI insertions, neural codec reconstruction, or partially AI-involved music, may be considered in future iterations.
Dataset
Hidden Test Dataset
The official evaluation set will be hidden. Participants will not receive the raw evaluation audio or item-level labels.
The hidden test set will contain both fully AI-generated music and real human-made music. The AI-generated portion will be constructed from multiple music generation systems, where licensing and evaluation conditions permit.
Candidate AI-generated sources may include:
- Suno
- Udio
- Mureka
- MiniMax
- YuE
- ACE-Step
The real-music negative examples will be selected from CC0 or otherwise evaluation-compatible human-made music sources.
The hidden test set will be designed to evaluate whether systems can generalize across multiple AI music generators rather than overfitting to a single source. The exact composition of the hidden test set will not be disclosed before evaluation.
Training Dataset
We plan to provide a training dataset to make the task easier to enter, especially for participants who do not have access to large-scale AI-generated music data.
Possible training sources include:
- SONICS: approximately 96k tracks generated by Suno v3.5 and Udio.
- Muse: approximately 116k tracks generated by Suno v5.
- Other AI-generated or human-made music sources, to be determined.
The final training dataset will be announced before the submission phase.
Participants may also use their own public, private, synthetic, or self-constructed training data, provided that no part of the hidden evaluation set is used directly or indirectly for training, validation, model selection, prompt tuning, or threshold tuning.
Input and Output Format
Input
Submitted systems will receive:
- A directory of audio files
- A metadata CSV file
Audio files will be provided as WAV files. They may be:
- 44.1 kHz or 48 kHz
- Mono or stereo
Output
Each system must output one AI-generated music score for each audio file.
The score must be a scalar value in the range [0, 1], where:
- 0 means the system believes the recording is very unlikely to be AI-generated.
- 1 means the system believes the recording is very likely to be AI-generated.
Baselines
We plan to provide a baseline model and checkpoint to help participants get started.
The baseline system may include:
- A standard audio classifier trained on the provided training dataset
- A music or audio foundation model with a binary classification head
- A reproducible inference pipeline
- A released checkpoint
- Example scripts for running inference and producing the required submission file
The baseline is intended as a starting point rather than a competitive upper bound. Participants are encouraged to improve upon it using better architectures, training strategies, data construction, calibration, and robustness methods.
Metrics
The official evaluation is binary. Each system outputs a continuous AI-generated music score for each test item.
Primary Metric
- Macro-averaged AUROC across hidden evaluation strata
- The primary ranking metric will be macro-averaged AUROC across hidden evaluation strata. AUROC is used because different real-world applications may require different operating thresholds. Macro-averaging prevents the final ranking from being dominated by easy subsets or by one particular generator family.
Secondary Metrics
Secondary metrics will be reported for diagnostic analysis. These may include:
- Pooled AUROC
- Measures overall ranking performance across all test items.
- AUPRC
- Measures precision-recall performance, especially under class imbalance.
- Equal Error Rate
- Reports the point where false positive rate and false negative rate are equal.
- Balanced Accuracy
- Measures classification accuracy while accounting for class balance.
- F1 Score
- Measures the harmonic mean of precision and recall at a selected threshold.
- False Positive Rate on Real Human Music
- Measures how often human-made music is incorrectly classified as AI-generated.
- False Negative Rate by Generator Source
- Measures how often AI-generated music from different generator families is missed.
Additional diagnostic results may include performance by vocal/instrumental category, generator-held-out condition, compression condition, excerpt length, and difficulty level.
Internal metadata will be used only for aggregate diagnostic reporting and will not be released at the item level.
Download
The hidden evaluation set will not be publicly released.
The held-out audio will remain private to the organizers throughout and after the evaluation. All evaluation items will be created, licensed, commissioned, or selected under conditions that permit private evaluation by the organizers.
The training dataset, baseline model, checkpoint, and example submission scripts will be released before the submission phase, subject to licensing and infrastructure constraints.
Rules
- Participants may use the provided training dataset and baseline model.
- Participants may use external datasets and pre-trained models.
- Participants may use public, private, synthetic, or self-constructed data for training.
- Participants must not use any part of the hidden evaluation set for training, validation, model selection, prompt tuning, or threshold tuning.
- Participants must describe all training data, pre-trained models, external APIs, watermark detectors, and major preprocessing steps in the technical report.
- External API calls are discouraged and may be prohibited depending on MIREX execution policy, privacy requirements, and reproducibility constraints.
- The full hidden test set must be processed within a 24-hour wall-clock budget on a single GPU.
- Submissions that exceed the time budget, fail on more than 5% of test items, or output invalid scores for more than 5% of test items will be reported but excluded from the primary ranking.
- Participants must respect all relevant licenses for the data and models used in their systems.
Submission
Participants are required to submit the following:
- Docker container
- A Docker container with a standardized inference interface. The system should take a directory of WAV files and a metadata CSV file as input, and produce a CSV file containing one AI-generated music score per item.
- Technical report
- A 2-4 page technical report in ISMIR LBD format. The report should describe the system architecture, training data, preprocessing, inference-time input duration, use of external APIs if any, use of watermark detection if any, thresholding strategy, known limitations, and compute requirements.
- Compute declaration
- A compute declaration reporting training data size, model size, GPU memory footprint, average inference time per track, total expected runtime, and other computational resources used in model development.
Output CSV Format
The output CSV should contain one row per audio file.
Example:
track_id,ai_generated_score 000001,0.972 000002,0.084 000003,0.611 000004,0.238
The exact metadata format and required file naming convention will be announced before the submission phase.
Each participant or team may submit up to four versions of their system. The final ranking will be based on the official evaluation metrics described above.
Submission Deadline: TBD
Submission Platform: TBD
Paper
Participants are encouraged to submit a short technical report describing their system, training data, and analysis of results.
Top-ranked participants may be invited to present their systems in a MIREX or ISMIR-related session, depending on the final organization of the task.
Task Captains
- Yixiao Zhang & You Zhang
Additional task captains may be added for dataset licensing, evaluation infrastructure, or conflict-of-interest management.
Future Iterations
The first iteration focuses on full-song AI-generated music detection.
Future iterations may extend the task to broader AI music provenance detection, including:
- AI-generated vocals or instrumental stems
- Human music with AI-generated stem replacement
- AI-assisted remixing or arrangement
- Localized AI-generated insertions or continuations
- Human-created music reconstructed through neural codecs or vocoders
- Segment-level localization
- Generator-held-out evaluation
- Open-set detection
- Robustness to adversarial post-processing
- Calibration under deployment-like class imbalance
Bibliography
[1] MIREX 2026 Call for Challenges. Music Information Retrieval Evaluation eXchange.
[2] MIREX 2025 Song Deepfake Detection Challenge. Music Information Retrieval Evaluation eXchange.
[3] SONICS: large-scale AI-generated music dataset including Suno v3.5 and Udio. Full citation to be added.
[4] Muse: large-scale AI-generated music dataset including Suno v5. Full citation to be added.
