2026:Lyrics Transcription
Contents
Description
This page describes the MIREX2026: Automatic Lyrics Transcription challenge. For evaluation procedure and the submission format please scroll down the page. It is strongly based upon the page of the 2025 ALT challenge.
The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:
Prediction(w) = argmax P(w|X)
where w and X are the word and acoustic features respectively.
Ideally, the lyrics transcriber should return meaningful word sequences:
Prediction(w) = [ <w_1>, <w_2>, ..., <w_N> ]
Note that for this year's edition, the input will always be a polyphonic mix (singing voice + musical accompaniment). The submitted algorithms can include a source-separation step if needed.
Notice: We particularly encourage participations that build on the latest ALT approaches such as using pretrained audio foundation models or LLMs.
Evaluation
This year's edition will include the following metrics for evaluation:
Word Error Rate (WER): the standard metric used in Automatic Speech Recognition.
WER = (S + I + D) / (C + S + D)
where;
C : correctly predicted words S : substitution errors I : insertion errors D : deletion errors
Character Error Rate (CER): the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.
Case-Sensitive WER (WER'): defined in [1], this metric is computed as follows:
where is the total number of words in a ground-truth lyrics file, and the number of "casing errors" i.e. words that differ in a case-sensitive setting like "city" and "City".
Punctuation, Parentheses, Line and Section Breaks: also defined in [1], we include a set of metrics to measure how well the proposed algorithms predict several formatting tokens. The tokens are classified into one of 5 types for Word (W), Punctuation (P), Parentheses (B), Line Breaks (L), and Section Breaks (S). Then, for each token type (except Word), we compute Precision, Recall, and F1-Score:
All these metrics will be based on the public code implementation from [1], simply extended to include CER.
Note that WER and CER will be retained as the main ranking criteria, other metrics will be reported for more detailed comparisons.
In addition to these performance metrics, each submission will be evaluated in terms of memory use, number of operations, and computational time required to process a sample.
Submission Format
Several submission formats will be accepted to accommodate the different approaches chosen by the participants. If you encounter any issue with the submission process, please contact the Task Captain.
General Guidelines
All submissions should be "plug-and-play", with a clear README detailing usage steps.
The recommended submission format is a Docker image or a code repository with a main bash or Python script to run.
Resources Declaration: All submissions must state:
- The training data size (hours)
- The number of parameters in the model (if applicable)
- The amount of GPU/CPU hours used for training, with device information (model and VRAM)
- The inference time per hour of audio
I / O
The submitted algorithm must take as input argument a path to a folder with .wav files and output predicted lyrics to a destination folder, each result file named as the corresponding input, with the extension changed.
Input Audio
Participating algorithms will have to receive the following audio input format:
- Musical tracks with accompaniment (polyphonic, not vocals only)
- 16-bit PCM
wavfiles - 44.1 kHz sampling rate
- stereo (2 channels)
Be aware that the wav files are obtained through conversion from compressed formats (mp3 or AAC) for the test sets we considered, so the quality is not exactly that of a perfect 44.1 kHz wav file.
This format is chosen as a standard input. If your pipeline expect different characteristics, the conversion process should be part of the submitted algorithm.
Output File Format
A text file (per song) containing the lyrics, organized by lines and paragraphs.
<line1: word1> <line1: word_2> ... <line1: word_N>\n <line2: word1> <line2: word_2> ... <line2: word_N>\n \n <line3: word1> <line3: word_2> ... <line3: word_N>\n <line4: word1> <line4: word_2> ... <line4: word_N>\n
Words will be identified by splitting the text file at spaces, tabs, and newline characters. The submission can also simply generate a file of words separated by spaces, but it will then perform (very) poorly on the new metrics taken from [1].
API-based submissions
For systems that rely on commercial LLM APIs (e.g. systems similar to [2]), participants must specify the exact model version and, where possible, provide a fallback local model to allow offline evaluation. The algorithm should either already contain an API key allowing to use the model (we recommend setting an access token with a clear duration or computational limit), or use the fallback option described hereafter.
Fallback: Pre-computed lyrics submission
In situations where the inference cannot be run directly by the Task Captain, participants may provide the lyrics files produced by their algorithms. This should only be considered a fallback option in case the algorithm relies on proprietary or commercial resources that cannot be made accessible to the Task Captains. Such submissions will be clearly flagged in the results page.
Training Datasets
In this challenge, the participants are encouraged but not obliged to use the open source dataset DALI described below.
The DAMP dataset is unfortunately no longer available.
Participants are free to use other datasets for training, as long as they are described in the submission and don't overlap with the evaluation sets.
DALI Dataset
DALI (a large Dataset of synchronised Audio, LyrIcs and notes) [3] is the benchmark dataset for building an acoustic model on polyphonic recordings [4], [5], [6] and it contains over 7000 songs with semi-automatically aligned lyrics annotations.
The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note).
For each song DALI provides a link to a matched youtube video for the audio retrieval.
Evaluation Datasets
The datasets listed below are reserved exclusively for evaluation purposes and must not be used for training models under any circumstances.
We also request participants to refrain from checking the performance of their algorithms on these sets before submission, as it would make them equivalent to validation sets and lead to data leakage of some sort.
Jam-ALT dataset
The Jam-ALT dataset is built upon the Jamendo dataset [7] and contains 79 songs in 4 languages: English, French, German, and Spanish. All tracks include instrumental accompaniment. The lyrics were cleaned and their format harmonized in [1].
If you’re working with English-only models, only the 20 English songs will be used for evaluation.
The mp3 tracks from the original dataset were converted to 16-bit PCM wav at 44.1 kHz sample rate for this MIREX challenge.
MUSDB-ALT
Introduced in [3], this dataset contains 39 English songs and will be used as an additional evaluation set.
We use the dataset in its compressed stems version, which means that the spectral energy is null above 16kHz.
We extracted the full mixture from the Native Instruments stems files and exported them as 16-bit PCM wav files with 44.1 kHz sample rate.
Undisclosed Evaluation Set
In addition to Jam-ALT and MUSDB-ALT, the submissions will be evaluated on an in-house dataset of 10 English songs from the Western Popular Music repertoire, manually assembled and verified by the Task Captain.
Time and hardware limits
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed. A hard limit of 24 hours will be imposed on analysis times, using a single GPU with at most 32GB of VRAM (V100 or similar). Submissions that require more resources can be submitted using the fallback method described above.
Questions?
- Contact Alexandre D'Hooge (Alex, he/him): dhooge[at]gbu[dot]edu[dot]cn
Bibliography
[1] Cífka, O., et al. (2024). Lyrics Transcription for Humans: A Readability-Aware Benchmark. Proc. of the 25th ISMIR Conf.
[2] Zhuo, L., et al. (2023). LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT. Proc. of the 24th ISMIR Conf.
[3] Syed, J., et al. (2025). Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper. ICMEW.
[4] Gupta, C., Yılmaz, E., & Li, H. (2020). Automatic lyrics alignment and transcription in polyphonic music: Does background music help? In ICASSP 2020, 496-500. IEEE.
[5] Basak, S., Agarwal, S., Ganapathy, S., & Takahashi, N. (2021, June). End-to-End Lyrics Recognition with Voice to Singing Style Transfer. In ICASSP 2021, 266-270. IEEE.
[6] Demirel, E., Ahlbäck, S., & Dixon, S. (2021). MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription. Proc. ISMIR 2021.
[7] Stoller, D., Durand, S., & Ewert, S. (2019). End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-Character Recognition Model. In ICASSP 2019, IEEE.