2024:Singing Voice Deepfake Detection
Task Description
The WildSVDD challenge focuses on the detection of AI-generated singing voices in the wild. With the advancement of AI technology, singing voices generated by AI are becoming increasingly indistinguishable from human performances. This task challenges participants to develop systems capable of accurately distinguishing real singing voices from AI-generated ones, especially within the complex context of background music and diverse musical environments. Participants will leverage the WildSVDD dataset, which includes a wide variety of song clips, both bonafide and deepfake, to develop and evaluate their systems.
Dataset
- Description
- The WildSVDD dataset is an extension of the SingFake dataset, now expanded to include a more diverse and comprehensive collection of real and AI-generated singing voice clips. It comprises 97 singers with 2,007 deepfake and 1,216 bonafide song clips, annotated for accuracy.
- Description of Audio Files
- The audio files in the WildSVDD dataset represent a broad range of languages and singers. These clips include strong background music, simulating real-world conditions that challenge the distinction between real and AI-generated voices. The dataset ensures diversity in the source material, with varying levels of complexity in the musical contexts.
- Description of Split
- The dataset is divided into training and evaluation subsets. Test Set A includes new samples, while Test Set B represents the most challenging subset from the SingFake dataset. Participants are permitted to use the training data to create validation sets but must adhere to restrictions on the usage of the evaluation data.
Baseline
- Model Architecture
- Participants are referred to baseline systems from the SingFake and SingGraph projects. These baselines include state-of-the-art components for detecting AI-generated singing voices, incorporating advanced techniques like graph modeling and controlled SVDD analysis. The key features of these baselines include robust handling of background music and adaptation to different musical styles.
[1] SingFake: https://github.com/yongyizang/SingFake [2] SingGraph: https://github.com/xjchenGit/SingGraph
Metrics
The primary metric for evaluation is Equal Error Rate (EER), which reflects the system's ability to distinguish between bonafide and deepfake singing voices regardless of the threshold set. EER is preferred over accuracy as it does not depend on a fixed threshold, providing a more reliable assessment of system performance. A lower EER indicates a better distinction between real and AI-generated voices.
Download
The dataset and necessary resources can be accessed via the following links:
- Dataset download: [Zenodo WildSVDD](https://zenodo.org/records/10893604)
- Download tools: https://pastebin.com/YhpYXT9z, https://cobalt.tools/, https://github.com/ytdl-org/youtube-dl, https://github.com/yt-dlp/yt-dlp, https://www.locoloader.com/bilibili-video-downloader/
- Segmentation tool: [SingFake GitHub](https://github.com/yongyizang/SingFake/tree/main/dataset)
Participants are encouraged to use the provided tools to download and segment song clips to ensure consistency in evaluation.
Rules
Participants are allowed to use any publicly available datasets for training, excluding those used in the test set. Any additional data sources or pre-trained models must be clearly documented in the system descriptions. Private data or models are strictly prohibited to maintain fairness. All submissions should focus on segment-level evaluation, with results presented in a score file format.
Submission
- Results submission
Participants should submit a score TXT file that includes the URLs, segment start and end timestamps, and the corresponding scores indicating the system's confidence in identifying bonafide or deepfake clips. Submissions will be evaluated based on EER, and the results will be ranked accordingly.
- Paper submission