Task Description

The MIREX 2026 AI-Generated Music Detection Task invites participants to develop systems that can detect whether a music recording contains any meaningful AI involvement.

In this task, AI involvement means that some part of the music was generated, transformed, replaced, reconstructed, or substantially resynthesized by a modern AI music or audio model. This includes fully AI-generated songs, AI-generated vocals, AI-generated instrumental stems, AI-assisted remixes, localized AI insertions, AI continuations, and music reconstructed through neural codecs or vocoders.

Participants are asked to submit systems that take a music audio recording as input and output a probability score between 0 and 1. A higher score indicates that the system believes the recording is more likely to contain AI-generated or AI-transformed musical content.

The official task is binary:

Positive: the recording contains meaningful AI involvement.
Negative: the recording is fully human-composed and human-performed, without AI-generated, AI-replaced, AI-transformed, or AI-reconstructed musical content.

Conventional music production tools such as DAWs, synthesizers, sample libraries, EQ, compression, reverb, pitch correction, and mastering plugins are not considered AI involvement, unless they use a generative or reconstruction model to create, replace, transform, or resynthesize musical content.

Dataset

Test Dataset

The official evaluation set will be hidden. Participants will not receive the raw evaluation audio or item-level labels.

The hidden test set will contain approximately 200 music recordings, balanced between vocal and instrumental music. It will include both AI-involved and fully human-made examples.

Positive examples may include:

Fully AI-generated vocal songs
Fully AI-generated instrumental music
Human compositions with AI-generated vocals
Human performances with AI-generated instrumental stems
AI-assisted remixes or arrangements
Localized AI-generated insertions or continuations
Human-created music reconstructed through neural codecs or vocoders

Negative examples may include:

Fully human-composed and human-performed music
Human music with conventional production processing
Human music with non-generative digital editing

The test set will be stratified by content type, origin, intervention level, generator family, post-processing condition, and difficulty level. Candidate generator families may include systems such as Suno, Mureka, MiniMax, Seed Music, YuE, Lyria, and other music-generation or audio-generation models, where licensing permits.

The exact composition of the hidden test set will not be disclosed before evaluation.

Training Data

No official training set is provided in the first iteration.

Participants may use public, private, synthetic, or self-constructed training data. However, no part of the hidden evaluation set may be used directly or indirectly for training, validation, model selection, prompt tuning, or threshold tuning.

Input and Output Format

Input

Submitted systems will receive:

A directory of audio files
A metadata CSV file

Audio files will be provided as WAV files. They may be:

44.1 kHz or 48 kHz
Mono or stereo

Output

Each system must output one AI-involvement score for each audio file.

The score must be a scalar value in the range [0, 1], where:

0 means the system believes the recording is very unlikely to contain AI involvement.
1 means the system believes the recording is very likely to contain AI involvement.

Optional outputs, such as predicted intervention type or segment-level localization, may be accepted for supplementary analysis. These optional outputs will not determine the primary ranking in the first iteration.

Baselines

Baseline systems are to be determined.

Possible baseline categories may include:

Audio classification models trained on public AI-generated music datasets
Music or audio foundation models with a binary classifier head
Spectrogram-based convolutional or transformer classifiers
Systems based on audio watermark detection, where applicable
General-purpose audio-language models prompted or fine-tuned for AI music detection

The final list of baselines will be announced before the submission phase.

Metrics

The official evaluation is binary. Each system outputs a continuous AI-involvement score for each test item.

Primary Metric

Macro-averaged AUROC across hidden evaluation strata: The primary ranking metric will be macro-averaged AUROC across hidden evaluation strata. AUROC is used because different real-world applications may require different detection thresholds. Macro-averaging prevents the final ranking from being dominated by easy cases, such as fully generated songs from a small number of generator families.

Secondary Metrics

Secondary metrics will be reported for diagnostic analysis. These may include:

Pooled AUROC: Measures overall ranking performance across all test items.

AUPRC: Measures precision-recall performance, especially under class imbalance.

Equal Error Rate: Reports the point where false positive rate and false negative rate are equal.

Balanced Accuracy: Measures classification accuracy while accounting for class balance.

F1 Score: Measures the harmonic mean of precision and recall at a selected threshold.

False Positive Rate on Human Music: Measures how often fully human music is incorrectly classified as AI-involved.

False Negative Rate by AI-Involvement Subtype: Measures how often different kinds of AI involvement are missed.

Additional diagnostic results may include performance by vocal/instrumental category, generator-held-out condition, compression condition, excerpt length, and difficulty level.

Internal metadata will be used only for aggregate diagnostic reporting and will not be released at the item level.

Download

There is no public download for the hidden evaluation set.

The held-out audio will remain private to the organizers throughout and after the evaluation. All evaluation items will be created, licensed, commissioned, or selected under conditions that permit private evaluation by the organizers.

Participants should prepare their systems according to the input and output format described above.

Rules

Participants may use external datasets and pre-trained models.
Participants may use public, private, synthetic, or self-constructed data for training.
Participants must not use any part of the hidden evaluation set for training, validation, model selection, prompt tuning, or threshold tuning.
Participants must describe all training data, pre-trained models, external APIs, watermark detectors, and major preprocessing steps in the technical report.
External API calls are discouraged and may be prohibited depending on MIREX execution policy, privacy requirements, and reproducibility constraints.
The full hidden test set must be processed within a 24-hour wall-clock budget on a single GPU.
Submissions that exceed the time budget, fail on more than 5% of test items, or output invalid scores for more than 5% of test items will be reported but excluded from the primary ranking.
Participants must respect all relevant licenses for the data and models used in their systems.

Submission

Participants are required to submit the following:

Docker container: A Docker container with a standardized inference interface. The system should take a directory of WAV files and a metadata CSV file as input, and produce a CSV file containing one AI-involvement score per item.

Technical report: A 2-4 page technical report in ISMIR LBD format. The report should describe the system architecture, training data, preprocessing, inference-time input duration, use of external APIs if any, use of watermark detection if any, thresholding strategy, known limitations, and compute requirements.

Compute declaration: A compute declaration reporting training data size, model size, GPU memory footprint, average inference time per track, total expected runtime, and other computational resources used in model development.

Output CSV Format

The output CSV should contain one row per audio file.

Example:

track_id,ai_involvement_score
000001,0.972
000002,0.084
000003,0.611
000004,0.238

The exact metadata format and required file naming convention will be announced before the submission phase.

Each participant or team may submit up to four versions of their system. The final ranking will be based on the official evaluation metrics described above.

Submission Deadline: TBD

Submission Platform: TBD

Paper

Participants are encouraged to submit a short technical report describing their system, training data, and analysis of results.

Top-ranked participants may be invited to present their systems in a MIREX or ISMIR-related session, depending on the final organization of the task.

Task Captains

Yixiao Zhang
Title: Research Scientist
Affiliation: ByteDance / Mureka / Independent [to be confirmed]
Email: [to be filled]
MIREX Wiki username: [to be filled]

Additional task captains may be added for dataset licensing, evaluation infrastructure, or conflict-of-interest management.

Bibliography

[1] MIREX 2026 Call for Challenges. Music Information Retrieval Evaluation eXchange.

[2] MIREX 2025 Song Deepfake Detection Challenge. Music Information Retrieval Evaluation eXchange.

2026:AI-Generated Music Detection

Contents

Task Description

Dataset

Test Dataset

Training Data

Input and Output Format

Input

Output

Baselines

Metrics

Primary Metric

Secondary Metrics

Download

Rules

Submission

Output CSV Format

Paper

Task Captains

Bibliography

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools