2021:Lyrics Transcription (former: Automatic Lyrics-to-Audio Alignment)
Contents
Description
This year we host the Automatic Lyrics Transcription challenge. You are free to participate in one of the tasks or both of them. The task of Lyrics Transcription aims to identify the words from sung music audio, in the same way as in automatic speech recognition.
The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Ideally, the lyrics transcriber should output meaningful word sequences:
Training Datasets
Datasets within automatic lyrics transcription research can be categorised under two domains: Monophonic and polyphonic recordings. The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment. In this challenge, we recommend using the open source datasets below:
DAMP dataset
The DAMP - Sing!300x30x2 dataset which consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application. The data is curated to be gender-wise balanced and contains performers from 30 different countries, which introduces a good amount of variability in terms of accents and pronunciation. list of recordings. For more details see the paper.
- The audio can be downloaded from the Smule web site
- Lyrics boundary annotations can be generated from raw annotations using this repository.
- Or annotations can be directly retrieved in the Kaldi form here
DALI Dataset
DALI (a large Dataset of synchronised Audio, LyrIcs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations. The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note). For each song DALI provides a link to a matched youtube video, from which the audio could be retrieved.
- For more details how, see its full description here.
Evaluation Datasets
The following datasets are used for evaluation and so cannot be used by participants to train their models under any circumstance. Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI. In case using DALI for training, you MUST exclude the songs listed above in training the model for a scientific evaluation.
Hansen's Dataset
The dataset contains 9 pop music songs released in early 2010s.
The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here.
You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The recordings have been provided by Jens Kofod Hansen for public evaluation.
- file duration up to 4:40 minutes (total time: 35:33 minutes)
- 3590 words annotated in total
Mauch's Dataset
The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word. The audio has instrumental accompaniment. An example song can be seen here.
You can read in detail about how the dataset was used for the first time here: Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. The dataset has been kindly provided by Sungkyun Chang.
- file duration up to 5:40 minutes (total time: 1h 19m)
- 5050 words annotated in total
Jamendo Dataset
This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment. It is available online on Github, although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to this paper.
- file duration up to 4:43 (total time: 1h 12m)
- 5677 words annotated in total
Evaluation
Transcription
Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.
Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.
Submission Format
Submissions should be packaged and contain at least two files: The algorithm itself (as a binary or source code) and a README containing contact information and detailing, in full, the use of the algorithm.
Input Data
Participating algorithms will have to receive the following input format:
- Audio format : WAV / MP3
- CD-quality (PCM, 16-bit, 44100 Hz)
- single channel (mono) for a cappella (Hansen) and two channels for original
Output File Format
A text file (per song) containing list of words separated by white space:
<word1> <word2> ... <wordN>
Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.
This file should ideally be located at ${output}/${input_song_id}.txt .
Command line calling format
The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:
foobar ${input_audio_path} ${output}
OR with flags:
foobar -i ${input_audio_path} -o ${output}
README File
A README file accompanying each submission should contain clear instructions on how to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.
Submission closing dates
Closing date: December 9, 2021
Questions?
- send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)
Potential Participants
Chitralekha Gupta
Emir Demirel
Gerardo Roa Dabike
Bibliography
Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.
Sharma B, Gupta C. (2019) Automatic Lyrics-to-audio Alignment on Polyphonic Music Using Singing-adapted Acoustic Models. ICASSP 2019
Lee S. W., Scott, J. (2017) Word-level lyrics-audio synchronization using separated vocals", Acoustics Speech and Signal Processing, ICASSP IEEE International Conference on, pp. 646-650
Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078.
Pons, J. Gong, R. and Serra, X. (2017). Score-informed syllable segmentation for a cappella singing voice with convolutional neural networks. ISMIR 2017
Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016
Dzhambazov, G. and Serra, X. (2015) Modeling of phoneme durations for alignment between polyphonic audio and lyrics, in 12th Sound and Music Computing Conference
Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.
Fujihara, H. Goto, M. Ogata, J. and Okuno, H. G. (2011) Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics, IEEE Journal of Selected Topics in Signal Processing
Mesaros, A. and Virtanen, T. (2008), Automatic alignment of music audio and lyrics, in Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.