Task Description

The WildSVDD challenge aims to detect AI-generated singing voices in real-world scenarios. The task involves distinguishing authentic human-sung songs from AI-generated deepfake songs at the clip level. Participants are required to identify whether each segmented clip contains a genuine singer or an AI-generated fake singer. The developed systems are expected to account for the complexities introduced by background music and various musical contexts. For more information about our prior work, please visit: https://main.singfake.org/

Background: With the advancement of AI technology, singing voices generated by AI are becoming increasingly indistinguishable from human performances. These synthesized voices can now emulate the vocal characteristics of any singer with minimal training data. While this technological advancement is impressive, it has sparked widespread concerns among artists, record labels, and publishing houses. The potential for unauthorized synthetic reproductions that mimic well-known singers poses a real threat to original artists' commercial value and intellectual property rights, igniting urgent calls for efficient and accurate methods to detect these deepfake singing voices.

This challenge is an extension of our precious work SingFake [1] and was initially introduced at the 2024 IEEE Spoken Language Technology Workshop (SLT 2024) [2] with CtrSVDD track and WildSVDD track. The CtrSVDD track [3] garnered significant attention from the speech community. We aim to raise more awareness for WildSVDD within the ISMIR community and leverage the expertise of music experts.

[1] Zang, Yongyi, You Zhang, Mojtaba Heydari, and Zhiyao Duan. "SingFake: Singing voice deepfake detection." In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12156-12160. IEEE, 2024. https://ieeexplore.ieee.org/document/10448184

[2] Zhang, You, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Tomoki Toda, and Zhiyao Duan. "SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge." In Proc. IEEE Spoken Language Technology (SLT), 2024. https://arxiv.org/abs/2408.16132

[3] Zang, Yongyi, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu et al. “CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection.” In Proc. Interspeech, pp. 4783-4787, 2024. https://doi.org/10.21437/Interspeech.2024-2242

Dataset

Description: The WildSVDD dataset is an extension of the SingFake dataset, now expanded to include a more diverse and comprehensive collection of real and AI-generated singing voice clips. We gathered data annotations from social media platforms. The annotators, who were familiar with the singers they covered, manually verified the user-specified labels during the annotation process to ensure accuracy, especially in cases where the singer(s) did not actually perform certain songs. We cross-checked the annotations against song titles and descriptions and manually reviewed any discrepancies for further verification.

Description of Audio Files: The audio files in the WildSVDD dataset represent a broad range of languages and singers. These clips include strong background music, simulating real-world conditions that challenge the distinction between real and AI-generated voices. The dataset ensures diversity in the source material, with varying levels of complexity in the musical contexts.

Description of Split: The dataset is divided into training and evaluation subsets. Test Set A includes new samples, while Test Set B represents the most challenging subset of the SingFake dataset. Participants are permitted to use the training data to create validation sets but must adhere to restrictions on the usage of the evaluation data.

Baseline

Model Architecture: Participants are referred to baseline systems from the SingFake [1] and SingGraph [2] projects. SingGraph includes state-of-the-art components for detecting AI-generated singing voices, incorporating advanced techniques like graph modeling. The key features of these baselines include robust handling of background music and adaptation to different musical styles. Some results of how baseline systems in SingFake perform on the WildSVDD test data can be found in our SVDD@SLT challenge overview paper [3].

[1] SingFake: https://github.com/yongyizang/SingFake

[2] SingGraph: https://github.com/xjchenGit/SingGraph

[3] SVDD 2024@SLT: https://arxiv.org/abs/2408.16132

Metrics

The primary metric for evaluation is Equal Error Rate (EER), which reflects the system's ability to distinguish between bonafide and deepfake singing voices regardless of the threshold set. EER is preferred over accuracy as it does not depend on a fixed threshold, providing a more reliable assessment of system performance. A lower EER indicates a better distinction between real and AI-generated voices.

Download

The dataset and necessary resources can be accessed via the following links:

Dataset download: [Zenodo WildSVDD](https://zenodo.org/records/10893604)
Download tools: https://pastebin.com/bFeruNA0, https://cobalt.tools/, https://github.com/ytdl-org/youtube-dl, https://github.com/yt-dlp/yt-dlp, https://www.locoloader.com/bilibili-video-downloader/
Segmentation tool: [SingFake GitHub](https://github.com/yongyizang/SingFake/tree/main/dataset)

Participants are encouraged to use the provided tools to download and segment song clips to ensure consistency in evaluation.

Rules

Participants are allowed to use any publicly available datasets for training, excluding those used in the test set. Any additional data sources or pre-trained models must be clearly documented in the system descriptions. Private data or models are strictly prohibited to maintain fairness. All submissions should focus on segment-level evaluation, with results presented in a score file format.

Submission

Submission Deadline: October 15, AOE

Results submission

Participants should submit a score TXT file that includes the URLs, segment start and end timestamps, and the corresponding scores indicating the system's confidence in identifying bonafide or deepfake clips. Submissions will be evaluated based on EER, and the results will be ranked accordingly.

System description submission: Participants are required to describe their system, including the data preprocessing, model architecture, training details, post-processing, etc.

Research paper submission: Participants are encouraged to submit a research paper to the MIREX track at ISMIR 2024.

Workshop presentation: We will invite top-ranked participants to present their work during the workshop session. The format will be hybrid to accommodate remote participation.

2024:Singing Voice Deepfake Detection

Contents

Task Description

Dataset

Baseline

Metrics

Download

Rules

Submission

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools