2019:Automatic Lyrics-to-Audio Alignment
Contents
Description
The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required.
----------------------- --------------------------------------------------- | Mixed singing audio | | Lyrics at word-level: no more carefree ... ... | ----------------------- --------------------------------------------------- | | -------------------------------------------- | -------------------- | Alignment system | -------------------- | | -------------------------- | 0.123 0.798 no | | 0.798 1.123 more | | 1.345 2.176 carefree| | ... ... | --------------------------
The algorithm receives two inputs - mixed singing audio (singing voice + musical accompaniment) and its corresponding lyrics at word-level, outputs the onset and offset timestamps (second) of each word.
Datasets
Training Dataset
The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the list of recordings.
- The audio can be downloaded from the Smule web site
- No lyrics boundary annotations are available, still the textual lyrics are on the Smule Sing! Karaoke website
Evaluation Datasets
Hansen's Dataset
The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided. The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here
You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The dataset has been kindly provided by Jens Kofod Hansen.
- file duration up to 4:40 minutes (total time: 35:33 minutes)
- 3590 words annotated in total
Mauch's Dataset
The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset. The audio has instrumental accompaniment. An example song can be seen here.
You can read in detail about how the dataset was used for the first time here: Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. The dataset has been kindly provided by Sungkyun Chang.
- file duration up to 5:40 minutes (total time: 1:19:12 hours)
- 5050 words annotated in total
Gracenote Dataset
The dataset contains 15 pop music song excerpts in English with annotations of beginning-timestamps of each word. 8 song excerpts have instrumental accompaniment. The other 7 song excerpts have has two versions: with instrumental accompaniment and a cappella singing.
- file duration up to 1:11 (total time: 11:42 minutes)
- 1181 words annotated in total
Phonetization
A popular choice for phonetization of the words is the CMU pronunciation dictionary. One can phonetize them with the online tool. A list of all words of both datasets, which are outside of the list of CMU words is given here.
Audio Format
The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)
- CD-quality (PCM, 16-bit, 44100 Hz)
- single channel (mono) for a cappella and two channels for original