2026:AI-Generated Music Detection
Contents
Task Description
The MIREX 2026 AI-Generated Music Detection Task invites participants to develop systems that can detect whether a music recording contains any meaningful AI involvement.
In this task, AI involvement means that some part of the music was generated, transformed, replaced, reconstructed, or substantially resynthesized by a modern AI music or audio model. This includes fully AI-generated songs, AI-generated vocals, AI-generated instrumental stems, AI-assisted remixes, localized AI insertions, AI continuations, and music reconstructed through neural codecs or vocoders.
Participants are asked to submit systems that take a music audio recording as input and output a probability score between 0 and 1. A higher score indicates that the system believes the recording is more likely to contain AI-generated or AI-transformed musical content.
The official task is binary:
- Positive: the recording contains meaningful AI involvement.
- Negative: the recording is fully human-composed and human-performed, without AI-generated, AI-replaced, AI-transformed, or AI-reconstructed musical content.
Conventional music production tools such as DAWs, synthesizers, sample libraries, EQ, compression, reverb, pitch correction, and mastering plugins are not considered AI involvement, unless they use a generative or reconstruction model to create, replace, transform, or resynthesize musical content.
Dataset
Test Dataset
The official evaluation set will be hidden. Participants will not receive the raw evaluation audio or item-level labels.
The hidden test set will contain approximately 200 music recordings, balanced between vocal and instrumental music. It will include both AI-involved and fully human-made examples.
Positive examples may include:
- Fully AI-generated vocal songs
- Fully AI-generated instrumental music
- Human compositions with AI-generated vocals
- Human performances with AI-generated instrumental stems
- AI-assisted remixes or arrangements
- Localized AI-generated insertions or continuations
- Human-created music reconstructed through neural codecs or vocoders
Negative examples may include:
- Fully human-composed and human-performed music
- Human music with conventional production processing
- Human music with non-generative digital editing
The test set will be stratified by content type, origin, intervention level, generator family, post-processing condition, and difficulty level. Candidate generator families may include systems such as Suno, Mureka, MiniMax, Seed Music, YuE, Lyria, and other music-generation or audio-generation models, where licensing permits.
The exact composition of the hidden test set will not be disclosed before evaluation.
Training Data
No official training set is provided in the first iteration.
Participants may use public, private, synthetic, or self-constructed training data. However, no part of the hidden evaluation set may be used directly or indirectly for training, validation, model selection, prompt tuning, or threshold tuning.
Input and Output Format
Input
Submitted systems will receive:
- A directory of audio files
- A metadata CSV file
Audio files will be provided as WAV files. They may be:
- 44.1 kHz or 48 kHz
- Mono or stereo
Output
Each system must output one AI-involvement score for each audio file.
The score must be a scalar value in the range [0, 1], where:
- 0 means the system believes the recording is very unlikely to contain AI involvement.
- 1 means the system believes the recording is very likely to contain AI involvement.
Optional outputs, such as predicted intervention type or segment-level localization, may be accepted for supplementary analysis. These optional outputs will not determine the primary ranking in the first iteration.
Baselines
Baseline systems are to be determined.
Possible baseline categories may include:
- Audio classification models trained on public AI-generated music datasets
- Music or audio foundation models with a binary classifier head
- Spectrogram-based convolutional or transformer classifiers
- Systems based on audio watermark detection, where applicable
- General-purpose audio-language models prompted or fine-tuned for AI music detection
The final list of baselines will be announced before the submission phase.
Metrics
The official evaluation is binary. Each system outputs a continuous AI-involvement score for each test item.
Primary Metric
- Macro-averaged AUROC across hidden evaluation strata
- The primary ranking metric will be macro-averaged AUROC across hidden evaluation strata. AUROC is used because different real-world applications may require different detection thresholds. Macro-averaging prevents the final ranking from being dominated by easy cases, such as fully generated songs from a small number of generator families.
Secondary Metrics
Secondary metrics will be reported for diagnostic analysis. These may include:
- Pooled AUROC
- Measures overall ranking performance across all test items.
- AUPRC
- Measures precision-recall performance, especially under class imbalance.
- Equal Error Rate
- Reports the point where false positive rate and false negative rate are equal.
- Balanced Accuracy
- Measures classification accuracy while accounting for class balance.
- F1 Score
- Measures the harmonic mean of precision and recall at a selected threshold.
- False Positive Rate on Human Music
- Measures how often fully human music is incorrectly classified as AI-involved.
- False Negative Rate by AI-Involvement Subtype
- Measures how often different kinds of AI involvement are missed.
Additional diagnostic results may include performance by vocal/instrumental category, generator-held-out condition, compression condition, excerpt length, and difficulty level.
Internal metadata will be used only for aggregate diagnostic reporting and will not be released at the item level.
Download
There is no public download for the hidden evaluation set.
The held-out audio will remain private to the organizers throughout and after the evaluation. All evaluation items will be created, licensed, commissioned, or selected under conditions that permit private evaluation by the organizers.
Participants should prepare their systems according to the input and output format described above.
Rules
- Participants may use external datasets and pre-trained models.
- Participants may use public, private, synthetic, or self-constructed data for training.
- Participants must not use any part of the hidden evaluation set for training, validation, model selection, prompt tuning, or threshold tuning.
- Participants must describe all training data, pre-trained models, external APIs, watermark detectors, and major preprocessing steps in the technical report.
- External API calls are discouraged and may be prohibited depending on MIREX execution policy, privacy requirements, and reproducibility constraints.
- The full hidden test set must be processed within a 24-hour wall-clock budget on a single GPU.
- Submissions that exceed the time budget, fail on more than 5% of test items, or output invalid scores for more than 5% of test items will be reported but excluded from the primary ranking.
- Participants must respect all relevant licenses for the data and models used in their systems.
Submission
Participants are required to submit the following:
- Docker container
- A Docker container with a standardized inference interface. The system should take a directory of WAV files and a metadata CSV file as input, and produce a CSV file containing one AI-involvement score per item.
- Technical report
- A 2-4 page technical report in ISMIR LBD format. The report should describe the system architecture, training data, preprocessing, inference-time input duration, use of external APIs if any, use of watermark detection if any, thresholding strategy, known limitations, and compute requirements.
- Compute declaration
- A compute declaration reporting training data size, model size, GPU memory footprint, average inference time per track, total expected runtime, and other computational resources used in model development.
Output CSV Format
The output CSV should contain one row per audio file.
Example:
track_id,ai_involvement_score 000001,0.972 000002,0.084 000003,0.611 000004,0.238
The exact metadata format and required file naming convention will be announced before the submission phase.
Each participant or team may submit up to four versions of their system. The final ranking will be based on the official evaluation metrics described above.
Submission Deadline: TBD
Submission Platform: TBD
Paper
Participants are encouraged to submit a short technical report describing their system, training data, and analysis of results.
Top-ranked participants may be invited to present their systems in a MIREX or ISMIR-related session, depending on the final organization of the task.
Task Captains
- Yixiao Zhang
- Title: Research Scientist
- Affiliation: ByteDance / Mureka / Independent [to be confirmed]
- Email: [to be filled]
- MIREX Wiki username: [to be filled]
Additional task captains may be added for dataset licensing, evaluation infrastructure, or conflict-of-interest management.
Bibliography
[1] MIREX 2026 Call for Challenges. Music Information Retrieval Evaluation eXchange.
[2] MIREX 2025 Song Deepfake Detection Challenge. Music Information Retrieval Evaluation eXchange.