Difference between revisions of "2026:AI-Generated Music Detection"

From MIREX Wiki
(Created page with "= Task Description = The MIREX 2026 AI-Generated Music Detection Task invites participants to develop systems that can detect whether a music recording contains any meaningfu...")
 
Line 1: Line 1:
 
= Task Description =
 
= Task Description =
  
The MIREX 2026 AI-Generated Music Detection Task invites participants to develop systems that can detect whether a music recording contains any meaningful AI involvement.
+
The MIREX 2026 AI-Generated Music Detection Task invites participants to develop systems that can detect whether a music recording is fully AI-generated or human-made.
  
In this task, AI involvement means that some part of the music was generated, transformed, replaced, reconstructed, or substantially resynthesized by a modern AI music or audio model. This includes fully AI-generated songs, AI-generated vocals, AI-generated instrumental stems, AI-assisted remixes, localized AI insertions, AI continuations, and music reconstructed through neural codecs or vocoders.
+
In the first iteration of this task, we focus on a simple and accessible binary setting:
  
Participants are asked to submit systems that take a music audio recording as input and output a probability score between 0 and 1. A higher score indicates that the system believes the recording is more likely to contain AI-generated or AI-transformed musical content.
+
* '''Positive''': the recording is fully AI-generated music.
 +
* '''Negative''': the recording is real human-made music.
  
The official task is binary:
+
Participants are asked to submit systems that take a music audio recording as input and output a probability score between 0 and 1. A higher score indicates that the system believes the recording is more likely to be AI-generated.
  
* '''Positive''': the recording contains meaningful AI involvement.
+
This first-year task is intentionally limited to full-song AI-generated music detection. More complex cases, such as AI-generated stems, AI-assisted remixing, localized AI insertions, neural codec reconstruction, or partially AI-involved music, may be considered in future iterations.
* '''Negative''': the recording is fully human-composed and human-performed, without AI-generated, AI-replaced, AI-transformed, or AI-reconstructed musical content.
 
 
 
Conventional music production tools such as DAWs, synthesizers, sample libraries, EQ, compression, reverb, pitch correction, and mastering plugins are not considered AI involvement, unless they use a generative or reconstruction model to create, replace, transform, or resynthesize musical content.
 
  
 
= Dataset =
 
= Dataset =
  
== Test Dataset ==
+
== Hidden Test Dataset ==
  
 
The official evaluation set will be hidden. Participants will not receive the raw evaluation audio or item-level labels.
 
The official evaluation set will be hidden. Participants will not receive the raw evaluation audio or item-level labels.
  
The hidden test set will contain approximately 200 music recordings, balanced between vocal and instrumental music. It will include both AI-involved and fully human-made examples.
+
The hidden test set will contain both fully AI-generated music and real human-made music. The AI-generated portion will be constructed from multiple music generation systems, where licensing and evaluation conditions permit.
 +
 
 +
Candidate AI-generated sources may include:
  
Positive examples may include:
+
* Suno
 +
* Udio
 +
* Mureka
 +
* MiniMax
 +
* YuE
 +
* ACE-Step
  
* Fully AI-generated vocal songs
+
The real-music negative examples will be selected from CC0 or otherwise evaluation-compatible human-made music sources.
* Fully AI-generated instrumental music
 
* Human compositions with AI-generated vocals
 
* Human performances with AI-generated instrumental stems
 
* AI-assisted remixes or arrangements
 
* Localized AI-generated insertions or continuations
 
* Human-created music reconstructed through neural codecs or vocoders
 
  
Negative examples may include:
+
The hidden test set will be designed to evaluate whether systems can generalize across multiple AI music generators rather than overfitting to a single source. The exact composition of the hidden test set will not be disclosed before evaluation.
  
* Fully human-composed and human-performed music
+
== Training Dataset ==
* Human music with conventional production processing
 
* Human music with non-generative digital editing
 
  
The test set will be stratified by content type, origin, intervention level, generator family, post-processing condition, and difficulty level. Candidate generator families may include systems such as Suno, Mureka, MiniMax, Seed Music, YuE, Lyria, and other music-generation or audio-generation models, where licensing permits.
+
We plan to provide a training dataset to make the task easier to enter, especially for participants who do not have access to large-scale AI-generated music data.
  
The exact composition of the hidden test set will not be disclosed before evaluation.
+
Possible training sources include:
  
== Training Data ==
+
* '''SONICS''': approximately 96k tracks generated by Suno v3.5 and Udio.
 +
* '''Muse''': approximately 116k tracks generated by Suno v5.
 +
* Other AI-generated or human-made music sources, to be determined.
  
No official training set is provided in the first iteration.
+
The final training dataset will be announced before the submission phase.
  
Participants may use public, private, synthetic, or self-constructed training data. However, no part of the hidden evaluation set may be used directly or indirectly for training, validation, model selection, prompt tuning, or threshold tuning.
+
Participants may also use their own public, private, synthetic, or self-constructed training data, provided that no part of the hidden evaluation set is used directly or indirectly for training, validation, model selection, prompt tuning, or threshold tuning.
  
 
= Input and Output Format =
 
= Input and Output Format =
Line 64: Line 63:
 
== Output ==
 
== Output ==
  
Each system must output one AI-involvement score for each audio file.
+
Each system must output one AI-generated music score for each audio file.
  
 
The score must be a scalar value in the range [0, 1], where:
 
The score must be a scalar value in the range [0, 1], where:
  
* 0 means the system believes the recording is very unlikely to contain AI involvement.
+
* 0 means the system believes the recording is very unlikely to be AI-generated.
* 1 means the system believes the recording is very likely to contain AI involvement.
+
* 1 means the system believes the recording is very likely to be AI-generated.
 
 
Optional outputs, such as predicted intervention type or segment-level localization, may be accepted for supplementary analysis. These optional outputs will not determine the primary ranking in the first iteration.
 
  
 
= Baselines =
 
= Baselines =
  
Baseline systems are to be determined.
+
We plan to provide a baseline model and checkpoint to help participants get started.
  
Possible baseline categories may include:
+
The baseline system may include:
  
* Audio classification models trained on public AI-generated music datasets
+
* A standard audio classifier trained on the provided training dataset
* Music or audio foundation models with a binary classifier head
+
* A music or audio foundation model with a binary classification head
* Spectrogram-based convolutional or transformer classifiers
+
* A reproducible inference pipeline
* Systems based on audio watermark detection, where applicable
+
* A released checkpoint
* General-purpose audio-language models prompted or fine-tuned for AI music detection
+
* Example scripts for running inference and producing the required submission file
  
The final list of baselines will be announced before the submission phase.
+
The baseline is intended as a starting point rather than a competitive upper bound. Participants are encouraged to improve upon it using better architectures, training strategies, data construction, calibration, and robustness methods.
  
 
= Metrics =
 
= Metrics =
  
The official evaluation is binary. Each system outputs a continuous AI-involvement score for each test item.
+
The official evaluation is binary. Each system outputs a continuous AI-generated music score for each test item.
  
 
== Primary Metric ==
 
== Primary Metric ==
  
 
; Macro-averaged AUROC across hidden evaluation strata
 
; Macro-averaged AUROC across hidden evaluation strata
: The primary ranking metric will be macro-averaged AUROC across hidden evaluation strata. AUROC is used because different real-world applications may require different detection thresholds. Macro-averaging prevents the final ranking from being dominated by easy cases, such as fully generated songs from a small number of generator families.
+
: The primary ranking metric will be macro-averaged AUROC across hidden evaluation strata. AUROC is used because different real-world applications may require different operating thresholds. Macro-averaging prevents the final ranking from being dominated by easy subsets or by one particular generator family.
  
 
== Secondary Metrics ==
 
== Secondary Metrics ==
Line 115: Line 112:
 
: Measures the harmonic mean of precision and recall at a selected threshold.
 
: Measures the harmonic mean of precision and recall at a selected threshold.
  
; False Positive Rate on Human Music
+
; False Positive Rate on Real Human Music
: Measures how often fully human music is incorrectly classified as AI-involved.
+
: Measures how often human-made music is incorrectly classified as AI-generated.
  
; False Negative Rate by AI-Involvement Subtype
+
; False Negative Rate by Generator Source
: Measures how often different kinds of AI involvement are missed.
+
: Measures how often AI-generated music from different generator families is missed.
  
 
Additional diagnostic results may include performance by vocal/instrumental category, generator-held-out condition, compression condition, excerpt length, and difficulty level.
 
Additional diagnostic results may include performance by vocal/instrumental category, generator-held-out condition, compression condition, excerpt length, and difficulty level.
Line 127: Line 124:
 
= Download =
 
= Download =
  
There is no public download for the hidden evaluation set.
+
The hidden evaluation set will not be publicly released.
  
 
The held-out audio will remain private to the organizers throughout and after the evaluation. All evaluation items will be created, licensed, commissioned, or selected under conditions that permit private evaluation by the organizers.
 
The held-out audio will remain private to the organizers throughout and after the evaluation. All evaluation items will be created, licensed, commissioned, or selected under conditions that permit private evaluation by the organizers.
  
Participants should prepare their systems according to the input and output format described above.
+
The training dataset, baseline model, checkpoint, and example submission scripts will be released before the submission phase, subject to licensing and infrastructure constraints.
  
 
= Rules =
 
= Rules =
  
 +
* Participants may use the provided training dataset and baseline model.
 
* Participants may use external datasets and pre-trained models.
 
* Participants may use external datasets and pre-trained models.
 
* Participants may use public, private, synthetic, or self-constructed data for training.
 
* Participants may use public, private, synthetic, or self-constructed data for training.
Line 149: Line 147:
  
 
; Docker container
 
; Docker container
: A Docker container with a standardized inference interface. The system should take a directory of WAV files and a metadata CSV file as input, and produce a CSV file containing one AI-involvement score per item.
+
: A Docker container with a standardized inference interface. The system should take a directory of WAV files and a metadata CSV file as input, and produce a CSV file containing one AI-generated music score per item.
  
 
; Technical report
 
; Technical report
Line 164: Line 162:
  
 
<pre>
 
<pre>
track_id,ai_involvement_score
+
track_id,ai_generated_score
 
000001,0.972
 
000001,0.972
 
000002,0.084
 
000002,0.084
Line 187: Line 185:
 
= Task Captains =
 
= Task Captains =
  
* Yixiao Zhang
+
* Yixiao Zhang & You Zhang
* Title: Research Scientist
 
* Affiliation: ByteDance / Mureka / Independent [to be confirmed]
 
* Email: [to be filled]
 
* MIREX Wiki username: [to be filled]
 
  
 
Additional task captains may be added for dataset licensing, evaluation infrastructure, or conflict-of-interest management.
 
Additional task captains may be added for dataset licensing, evaluation infrastructure, or conflict-of-interest management.
 +
 +
= Future Iterations =
 +
 +
The first iteration focuses on full-song AI-generated music detection.
 +
 +
Future iterations may extend the task to broader AI music provenance detection, including:
 +
 +
* AI-generated vocals or instrumental stems
 +
* Human music with AI-generated stem replacement
 +
* AI-assisted remixing or arrangement
 +
* Localized AI-generated insertions or continuations
 +
* Human-created music reconstructed through neural codecs or vocoders
 +
* Segment-level localization
 +
* Generator-held-out evaluation
 +
* Open-set detection
 +
* Robustness to adversarial post-processing
 +
* Calibration under deployment-like class imbalance
  
 
= Bibliography =
 
= Bibliography =
Line 200: Line 211:
  
 
[2] MIREX 2025 Song Deepfake Detection Challenge. Music Information Retrieval Evaluation eXchange.
 
[2] MIREX 2025 Song Deepfake Detection Challenge. Music Information Retrieval Evaluation eXchange.
 +
 +
[3] SONICS: large-scale AI-generated music dataset including Suno v3.5 and Udio. Full citation to be added.
 +
 +
[4] Muse: large-scale AI-generated music dataset including Suno v5. Full citation to be added.

Revision as of 10:20, 22 June 2026

Task Description

The MIREX 2026 AI-Generated Music Detection Task invites participants to develop systems that can detect whether a music recording is fully AI-generated or human-made.

In the first iteration of this task, we focus on a simple and accessible binary setting:

  • Positive: the recording is fully AI-generated music.
  • Negative: the recording is real human-made music.

Participants are asked to submit systems that take a music audio recording as input and output a probability score between 0 and 1. A higher score indicates that the system believes the recording is more likely to be AI-generated.

This first-year task is intentionally limited to full-song AI-generated music detection. More complex cases, such as AI-generated stems, AI-assisted remixing, localized AI insertions, neural codec reconstruction, or partially AI-involved music, may be considered in future iterations.

Dataset

Hidden Test Dataset

The official evaluation set will be hidden. Participants will not receive the raw evaluation audio or item-level labels.

The hidden test set will contain both fully AI-generated music and real human-made music. The AI-generated portion will be constructed from multiple music generation systems, where licensing and evaluation conditions permit.

Candidate AI-generated sources may include:

  • Suno
  • Udio
  • Mureka
  • MiniMax
  • YuE
  • ACE-Step

The real-music negative examples will be selected from CC0 or otherwise evaluation-compatible human-made music sources.

The hidden test set will be designed to evaluate whether systems can generalize across multiple AI music generators rather than overfitting to a single source. The exact composition of the hidden test set will not be disclosed before evaluation.

Training Dataset

We plan to provide a training dataset to make the task easier to enter, especially for participants who do not have access to large-scale AI-generated music data.

Possible training sources include:

  • SONICS: approximately 96k tracks generated by Suno v3.5 and Udio.
  • Muse: approximately 116k tracks generated by Suno v5.
  • Other AI-generated or human-made music sources, to be determined.

The final training dataset will be announced before the submission phase.

Participants may also use their own public, private, synthetic, or self-constructed training data, provided that no part of the hidden evaluation set is used directly or indirectly for training, validation, model selection, prompt tuning, or threshold tuning.

Input and Output Format

Input

Submitted systems will receive:

  • A directory of audio files
  • A metadata CSV file

Audio files will be provided as WAV files. They may be:

  • 44.1 kHz or 48 kHz
  • Mono or stereo

Output

Each system must output one AI-generated music score for each audio file.

The score must be a scalar value in the range [0, 1], where:

  • 0 means the system believes the recording is very unlikely to be AI-generated.
  • 1 means the system believes the recording is very likely to be AI-generated.

Baselines

We plan to provide a baseline model and checkpoint to help participants get started.

The baseline system may include:

  • A standard audio classifier trained on the provided training dataset
  • A music or audio foundation model with a binary classification head
  • A reproducible inference pipeline
  • A released checkpoint
  • Example scripts for running inference and producing the required submission file

The baseline is intended as a starting point rather than a competitive upper bound. Participants are encouraged to improve upon it using better architectures, training strategies, data construction, calibration, and robustness methods.

Metrics

The official evaluation is binary. Each system outputs a continuous AI-generated music score for each test item.

Primary Metric

Macro-averaged AUROC across hidden evaluation strata
The primary ranking metric will be macro-averaged AUROC across hidden evaluation strata. AUROC is used because different real-world applications may require different operating thresholds. Macro-averaging prevents the final ranking from being dominated by easy subsets or by one particular generator family.

Secondary Metrics

Secondary metrics will be reported for diagnostic analysis. These may include:

Pooled AUROC
Measures overall ranking performance across all test items.
AUPRC
Measures precision-recall performance, especially under class imbalance.
Equal Error Rate
Reports the point where false positive rate and false negative rate are equal.
Balanced Accuracy
Measures classification accuracy while accounting for class balance.
F1 Score
Measures the harmonic mean of precision and recall at a selected threshold.
False Positive Rate on Real Human Music
Measures how often human-made music is incorrectly classified as AI-generated.
False Negative Rate by Generator Source
Measures how often AI-generated music from different generator families is missed.

Additional diagnostic results may include performance by vocal/instrumental category, generator-held-out condition, compression condition, excerpt length, and difficulty level.

Internal metadata will be used only for aggregate diagnostic reporting and will not be released at the item level.

Download

The hidden evaluation set will not be publicly released.

The held-out audio will remain private to the organizers throughout and after the evaluation. All evaluation items will be created, licensed, commissioned, or selected under conditions that permit private evaluation by the organizers.

The training dataset, baseline model, checkpoint, and example submission scripts will be released before the submission phase, subject to licensing and infrastructure constraints.

Rules

  • Participants may use the provided training dataset and baseline model.
  • Participants may use external datasets and pre-trained models.
  • Participants may use public, private, synthetic, or self-constructed data for training.
  • Participants must not use any part of the hidden evaluation set for training, validation, model selection, prompt tuning, or threshold tuning.
  • Participants must describe all training data, pre-trained models, external APIs, watermark detectors, and major preprocessing steps in the technical report.
  • External API calls are discouraged and may be prohibited depending on MIREX execution policy, privacy requirements, and reproducibility constraints.
  • The full hidden test set must be processed within a 24-hour wall-clock budget on a single GPU.
  • Submissions that exceed the time budget, fail on more than 5% of test items, or output invalid scores for more than 5% of test items will be reported but excluded from the primary ranking.
  • Participants must respect all relevant licenses for the data and models used in their systems.

Submission

Participants are required to submit the following:

Docker container
A Docker container with a standardized inference interface. The system should take a directory of WAV files and a metadata CSV file as input, and produce a CSV file containing one AI-generated music score per item.
Technical report
A 2-4 page technical report in ISMIR LBD format. The report should describe the system architecture, training data, preprocessing, inference-time input duration, use of external APIs if any, use of watermark detection if any, thresholding strategy, known limitations, and compute requirements.
Compute declaration
A compute declaration reporting training data size, model size, GPU memory footprint, average inference time per track, total expected runtime, and other computational resources used in model development.

Output CSV Format

The output CSV should contain one row per audio file.

Example:

track_id,ai_generated_score
000001,0.972
000002,0.084
000003,0.611
000004,0.238

The exact metadata format and required file naming convention will be announced before the submission phase.

Each participant or team may submit up to four versions of their system. The final ranking will be based on the official evaluation metrics described above.

Submission Deadline: TBD

Submission Platform: TBD

Paper

Participants are encouraged to submit a short technical report describing their system, training data, and analysis of results.

Top-ranked participants may be invited to present their systems in a MIREX or ISMIR-related session, depending on the final organization of the task.

Task Captains

  • Yixiao Zhang & You Zhang

Additional task captains may be added for dataset licensing, evaluation infrastructure, or conflict-of-interest management.

Future Iterations

The first iteration focuses on full-song AI-generated music detection.

Future iterations may extend the task to broader AI music provenance detection, including:

  • AI-generated vocals or instrumental stems
  • Human music with AI-generated stem replacement
  • AI-assisted remixing or arrangement
  • Localized AI-generated insertions or continuations
  • Human-created music reconstructed through neural codecs or vocoders
  • Segment-level localization
  • Generator-held-out evaluation
  • Open-set detection
  • Robustness to adversarial post-processing
  • Calibration under deployment-like class imbalance

Bibliography

[1] MIREX 2026 Call for Challenges. Music Information Retrieval Evaluation eXchange.

[2] MIREX 2025 Song Deepfake Detection Challenge. Music Information Retrieval Evaluation eXchange.

[3] SONICS: large-scale AI-generated music dataset including Suno v3.5 and Udio. Full citation to be added.

[4] Muse: large-scale AI-generated music dataset including Suno v5. Full citation to be added.