Difference between revisions of "2019:Automatic Lyrics-to-Audio Alignment"
(→Potential Participants) |
|||
Line 3: | Line 3: | ||
The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required. | The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required. | ||
− | == | + | ----------------------- --------------------------------------------------- |
+ | | Mixed singing audio | | Lyrics at word-level: no more carefree ... ... | | ||
+ | ----------------------- --------------------------------------------------- | ||
+ | | | | ||
+ | -------------------------------------------- | ||
+ | | | ||
+ | -------------------- | ||
+ | | Alignment system | | ||
+ | -------------------- | ||
+ | | | ||
+ | | | ||
+ | -------------------------- | ||
+ | | 0.123 0.798 no | | ||
+ | | 0.798 1.123 more | | ||
+ | | 1.345 2.176 carefree| | ||
+ | | ... ... | | ||
+ | -------------------------- | ||
+ | The algorithm receives two inputs - mixed singing audio (singing voice + musical accompaniment) and its corresponding lyrics at word-level, outputs the onset and offset timestamps (second) of each word. | ||
+ | |||
+ | ==Datasets== | ||
===Training Dataset=== | ===Training Dataset=== | ||
Line 30: | Line 49: | ||
* file duration up to 5:40 minutes (total time: 1:19:12 hours) | * file duration up to 5:40 minutes (total time: 1:19:12 hours) | ||
* 5050 words annotated in total | * 5050 words annotated in total | ||
+ | |||
==== Gracenote Dataset ==== | ==== Gracenote Dataset ==== | ||
Line 37: | Line 57: | ||
* 1181 words annotated in total | * 1181 words annotated in total | ||
− | + | === Phonetization === | |
− | |||
A popular choice for phonetization of the words is the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU pronunciation dictionary]. One can phonetize them with the [http://www.speech.cs.cmu.edu/tools/lextool.html online tool]. A list of all words of both datasets, which are outside of the [https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/for_english/cmudict.0.6d.syll list of CMU words] is given [https://www.dropbox.com/s/flu4cpqff916bas/words_not_in_dict?dl=0 here]. | A popular choice for phonetization of the words is the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU pronunciation dictionary]. One can phonetize them with the [http://www.speech.cs.cmu.edu/tools/lextool.html online tool]. A list of all words of both datasets, which are outside of the [https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/for_english/cmudict.0.6d.syll list of CMU words] is given [https://www.dropbox.com/s/flu4cpqff916bas/words_not_in_dict?dl=0 here]. | ||
− | + | === Audio Format === | |
The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files) | The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files) | ||
Line 47: | Line 66: | ||
* CD-quality (PCM, 16-bit, 44100 Hz) | * CD-quality (PCM, 16-bit, 44100 Hz) | ||
* single channel (mono) for a cappella and two channels for original | * single channel (mono) for a cappella and two channels for original | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Potential Participants == | == Potential Participants == |
Revision as of 08:38, 5 July 2019
Contents
Description
The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required.
----------------------- --------------------------------------------------- | Mixed singing audio | | Lyrics at word-level: no more carefree ... ... | ----------------------- --------------------------------------------------- | | -------------------------------------------- | -------------------- | Alignment system | -------------------- | | -------------------------- | 0.123 0.798 no | | 0.798 1.123 more | | 1.345 2.176 carefree| | ... ... | --------------------------
The algorithm receives two inputs - mixed singing audio (singing voice + musical accompaniment) and its corresponding lyrics at word-level, outputs the onset and offset timestamps (second) of each word.
Datasets
Training Dataset
The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the list of recordings.
- The audio can be downloaded from the Smule web site
- No lyrics boundary annotations are available, still the textual lyrics are on the Smule Sing! Karaoke website
Evaluation Datasets
Hansen's Dataset
The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided. The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here
You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The dataset has been kindly provided by Jens Kofod Hansen.
- file duration up to 4:40 minutes (total time: 35:33 minutes)
- 3590 words annotated in total
Mauch's Dataset
The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset. The audio has instrumental accompaniment. An example song can be seen here.
You can read in detail about how the dataset was used for the first time here: Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. The dataset has been kindly provided by Sungkyun Chang.
- file duration up to 5:40 minutes (total time: 1:19:12 hours)
- 5050 words annotated in total
Gracenote Dataset
The dataset contains 15 pop music song excerpts in English with annotations of beginning-timestamps of each word. 8 song excerpts have instrumental accompaniment. The other 7 song excerpts have has two versions: with instrumental accompaniment and a cappella singing.
- file duration up to 1:11 (total time: 11:42 minutes)
- 1181 words annotated in total
Phonetization
A popular choice for phonetization of the words is the CMU pronunciation dictionary. One can phonetize them with the online tool. A list of all words of both datasets, which are outside of the list of CMU words is given here.
Audio Format
The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)
- CD-quality (PCM, 16-bit, 44100 Hz)
- single channel (mono) for a cappella and two channels for original