Difference between revisions of "2024:Audio Chord Estimation"

From MIREX Wiki
(Created page with "2024:Audio Chord Estimation")
 
Line 1: Line 1:
2024:Audio Chord Estimation
+
= Description =
 +
 
 +
This task requires participants to extract or transcribe a sequence of chords from an audio music recording. For many applications in music information retrieval, extracting the harmonic structure of an audio track is very desirable, for example for segmenting pieces into characteristic segments, for finding similar pieces, or for semantic analysis of music. The extraction of the harmonic structure requires the estimation of a sequence of chords that is as precise as possible. This includes the full characterisation of chords – root, quality, and bass note – as well as their chronological order, including specific onset times and durations. Audio chord estimation has a long history in MIREX, and readers interested in this history, especially with respect to evaluation methodology, should review the work of Christopher Harte (2010), Pauwels and Peeters (2013), and the [https://www.music-ir.org/mirex/wiki/The_Utrecht_Agreement_on_Chord_Evaluation “Utrecht Agreement”] on evaluation metrics. For python evaluation code, please refer to [https://craffel.github.io/mir_eval/#module-mir_eval.chord “mir_eval”].
 +
 
 +
= What's new =
 +
 
 +
Compared to previous years:
 +
 
 +
* Submission format: docker image is required. See the submission format section.
 +
* An online leaderboard: there will be an online leaderboard for the evaluation result on a small validation set. Details will follow when the submission site opens.
 +
* Potential new test datasets: we are discussing possibilities with dataset holders to use external proprietary datasets for validation and test sets. The dataset will be private. Details about the dataset's information (number of songs, genre distribution, etc) will follow shortly.
 +
 
 +
= Data =
 +
 
 +
Two datasets are used to evaluate chord transcription accuracy.
 +
 
 +
; Isophonics
 +
: The collected Beatles, Queen, and Zweieck datasets from the Centre for Digital Music at Queen Mary, University of London (http://www.isophonics.net/), as used for Audio Chord Estimation in MIREX for many years. Available from http://www.isophonics.net/. See also Matthias Mauch’s dissertation (2010) and Harte et al.’s introductory paper (2005).
 +
; Billboard
 +
: An abridged version of the ''Billboard'' dataset from McGill University, including a representative sample of American popular music from the 1950s through the 1990s. Available from http://billboard.music.mcgill.ca. See also Ashley Burgoyne’s dissertation (2012) and Burgoyne et al.’s introductory paper (2011). Parsing tools for the data are available from http://hackage.haskell.org/package/billboard-parser/ and documented by De Haas and Burgoyne (2012).
 +
 
 +
== Training and Testing ==
 +
 
 +
The training and testing divisions differ for the two data sets. The Isophonics has been available publicly for so long that it no longer makes sense to offer a separate training phase; as such, the entire data set will be used for testing, as in previous years. In contrast, in order to support MIREX, a portion of the ''Billboard'' ground truth has been withheld from the public. Submissions may train on all of the songs that have been publicly released so far: the MIREX servers have access to the ground-truth annotations and the original audio. Whether trained or not, all submissions will be tested against a fresh set of 200 songs that have never been released publicly.
 +
 
 +
The ground-truth files contain one line per unique chord, in the form <code>{start_time end_time chord}</code>, e.g.,
 +
<pre>...
 +
41.2631021 44.2456460 B:maj
 +
44.2456460 45.7201230 E:maj
 +
45.7201230 47.2061900 E:7/3
 +
47.2061900 48.6922670 A:maj
 +
48.6922670 50.1551240 A:min/b3
 +
...</pre>
 +
Start and end times are in seconds from the start of the file. Chord labels follow the syntax proposed by C. Harte et al. (2005). Please note that the syntax has changed slightly since since it was originally described; in particular, the root is no longer implied as a voiced element of a chord so a C major chord (notes C, E and G) should be written C:(1,3,5) instead of just C:(3,5) if using the interval list representation. As before, the labels C and C:maj are equivalent to C:(1,3,5).
 +
 
 +
= Evaluation =
 +
 
 +
To evaluate the quality of an automatic transcription, a transcription is compared to ground truth created by one or more human annotators. MIREX typically uses ''chord symbol recall'' (CSR) to estimate how well the predicted chords match the ground truth:
 +
 
 +
<math>\textrm{CSR} =  \frac{\text{total duration of segments where annotation equals estimation}}  {\text{total duration of annotated segments}}</math>
 +
 
 +
In previous years, MIREX has used an approximate CSR calculated by sampling both the ground-truth and the automatic annotations every 10 ms and dividing the number of correctly annotated samples by the total number of samples. Following Christopher Harte (2010, §8.1.2), however, we can view the ground-truth and estimated annotations as continuous segmentations of the audio and calculate the CSR by considering the cumulative length of the correctly overlapping segments. This way of calculating the CSR is more precise, as the precision of the frame-based method is limited by the frame length, and computationally more efficient, as it reduces the number of segment comparisons. Because pieces of music come in a wide variety of lengths, we will weight the CSR by the length of the song when computing an average for a given corpus. This final number is referred to as the ''weighted chord symbol recall'' (WCSR).
 +
 
 +
== Chord Vocabularies ==
 +
 
 +
We propose a set of single chord evaluation measures for MIREX that extends the previous iterations of MIREX and combines it with evaluation measures proposed in the literature, providing a more complete assessment of the transcription quality. Following Pauwels and Peeters (2013), we suggest using the CSR with five different chord vocabulary mappings.
 +
 
 +
In each of these calculations, the full chord descriptions of either the estimated or the ground-truth transcriptions, which might contain complex chord annotations, would be mapped to the following classes:
 +
 
 +
# Chord root note only;
 +
# Major and minor: {<code>N, maj, min</code>};
 +
# Seventh chords: {<code>N, maj, min, maj7, min7, 7</code>};
 +
# Major and minor with inversions: {<code>N, maj, min, maj/3, min/b3, maj/5, min/5</code>}; or
 +
# Seventh chords with inversions: {<code>N, maj, min, maj7, min7, 7, maj/3, min/b3, maj7/3, min7/b3, 7/3, maj/5, min/5, maj7/5, min7/5, 7/5, maj7/7, min7/b7, 7/b7</code>}.
 +
 
 +
With the exception of no-chords, calculating the vocabulary mapping involves examining the root note, the bass note, and the relative interval structure of the chord labels. A mapping exists if both the root notes and bass notes match, and the structure of the output label is the largest possible subset of the input label given the vocabulary. For instance, in the major and minor case, <code>G:7(#9)</code> is mapped to <code>G:maj</code> because the interval set of <code>G:maj</code>, {<code>1,3,5</code>}, is a subset of the interval set of the <code>G:7(#9)</code>, {<code>1,3,5,b7,#9</code>}. In the seventh-chord case, <code>G:7(#9)</code> is mapped to <code>G:7</code> instead because the interval set of <code>G:7</code> {<code>1, 3, 5, b7</code>} is also a subset of <code>G:7(#9)</code> but is larger than <code>G:maj</code>. If a chord cannot be represented by a certain class, e.g., mapping a <code>D:aug</code> or <code>F:sus4(9)</code> to {<code>maj, min</code>}, the chord is excluded from the evaluation if it occurs in the ground-truth, and it is considered a mismatch if it occurs in an estimated annotation.
 +
 
 +
{|
 +
|+ Most frequent chord qualities in the McGill ''Billboard'' corpus.
 +
! Quality
 +
! Freq. (%)
 +
! Cum. Freq (%)
 +
|-
 +
|maj
 +
|52
 +
|52
 +
|-
 +
|min
 +
|13
 +
|65
 +
|-
 +
|7
 +
|10
 +
|75
 +
|-
 +
|min7
 +
|8
 +
|83
 +
|-
 +
|maj7
 +
|3
 +
|86
 +
|-
 +
|5
 +
|2
 +
|88
 +
|-
 +
|1
 +
|2
 +
|90
 +
|-
 +
|maj(9)
 +
|1
 +
|91
 +
|-
 +
|maj6
 +
|1
 +
|92
 +
|-
 +
|sus4
 +
|1
 +
|93
 +
|-
 +
|sus7
 +
|1
 +
|94
 +
|-
 +
|sus9
 +
|1
 +
|94
 +
|-
 +
|7(#9)
 +
|1
 +
|95
 +
|-
 +
|min9
 +
|1
 +
|96
 +
|}
 +
 
 +
Our recommendations are motivated by the frequencies of chord qualities in the ''Billboard'' corpus (see table above), which is a balanced sample of American popular music from the 1950s through the 1990s (J.A. Burgoyne, Wild, and Fujinaga 2011). Pure major and minor chords alone account for 65 percent of all chords encountered, whereas augmented and diminished triads account for 0.2 percent or less of the corpus each. Our arguments for our particular seventh-chord vocabulary as opposed to the set of all tetrads follows similar reasoning; our proposed vocabulary accounts for 86 percent of all chords, whereas no other standard type of seventh chord accounts for more than 0.2 percent of the corpus. In future years, the table suggests that we might consider introducing vocabularies including power chords, and possibly suspended chords or added sixths and ninths as well.
 +
 
 +
== Chord Segmentation ==
 +
 
 +
Besides CSR, the chord transcription literature includes several other metrics for evaluating chord transcriptions, which mainly focus on the segmentation of the automatic transcription. We propose to include the directional Hamming distance in the evaluation. The directional Hamming distance is calculated by finding for each annotated segment the maximally overlapping segment in the other annotation, and then summing the differences ((S. A. Abdallah et al. 2005); (Mauch 2010, §2.3.3)). Depending on the order of application, the directional Hamming distance yields a measure of over- or under segmentation. Both directions can be combined to yield an overall quality metric (Christopher Harte 2010, §8.3.2):
 +
 
 +
<math>Q = 1 - \frac{\text{maximum of directional Hamming distances in either direction}}      {\text{total duration of song}}</math>
 +
 
 +
= Submission Format =
 +
 
 +
== Audio Format ==
 +
 
 +
Audio tracks in the training directory will be encoded as 44.1 kHz 16bit mono WAV files.
 +
 
 +
== I/O Format ==
 +
 
 +
The algorithms should output text files with a similar format to that used in the ground truth transcriptions. That is to say, they should be flat text files with chord segment labels and times arranged thus:
 +
 
 +
<pre>start_time end_time chord_label</pre>
 +
with elements separated by white spaces, times given in seconds, chord labels corresponding to the syntax described by C. Harte et al. (2005), and one chord segment per line. As in all benchmarks after 2008, end times are a mandatory component of the output. For the evaluation process we will assume enharmonic equivalence for chord roots. We will no longer accept participants who would only like to be evaluated on major/minor chords and want to use the number format.
 +
 
 +
== Command line calling format ==
 +
 
 +
Submissions using machine learning models must also submit their trained models. Training on the evaluation server is no longer supported starting from this year. We will execute the following commands for testing:
 +
 
 +
<pre>prepare.sh
 +
doChordID.sh &quot;/path/to/input1.wav&quot; &quot;/path/to/output1.wav.txt&quot;
 +
doChordID.sh &quot;/path/to/input2.wav&quot; &quot;/path/to/output2.wav.txt&quot;
 +
...</pre>
 +
 
 +
In the results directory, there should be one file for each testfile with same name as the test file + <code>.txt</code>. Programs can use their working directory if they need to keep temporary cache files or internal debugging info. Standard output and standard error will be logged.
 +
 
 +
No internet access is allowed during the inference stage (<code>doChordID.sh</code>). Please contact us if your model requires internet access (e.g., model API call) during inference.
 +
 
 +
== Packaging submissions ==
 +
 
 +
* Every submission must be packed into a docker image
 +
* Every submission will be deployed and evaluated automatically with <code>docker run</code>
 +
 
 +
Accepted submission form:
 +
* Link to public or private Github repository
 +
* Link to public or private docker hub
 +
* Shared google drive links
 +
* If the repository is private, an access token is also required
 +
 
 +
= Time and Hardware limits =
 +
 
 +
A Linux server with one Nvidia GeForce RTX 3090 is used for evaluation. CPU, OS, and memory specifications will be announced later.
 +
 
 +
Time limit: within 5 times the total duration of the test set.
 +
 
 +
= Bibliography =
 +
 
 +
Abdallah, Samer A., Katy Noland, Mark B. Sandler, Michael Casey, and Christophe Rhodes. 2005. “Theory and Evaluation of a Bayesian Music Structure Extractor.” In ''Proceedings of the International Society for Music Information Retrieval Conference'', 420–425.
 +
 
 +
Burgoyne, J. A., J. Wild, and I. Fujinaga. 2011. “An expert ground truth set for audio chord recognition and music analysis.” In ''Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR)'', 633–638.
 +
 
 +
Burgoyne, John Ashley. 2012. “Stochastic Processes and Database-Driven Musicology.” Ph.D. diss. Montréal, Québec, Canada: McGill University.
 +
 
 +
Haas, W. B. de, and John~Ashley Burgoyne. 2012. ''Parsing the Billboard Chord Transcriptions''. Technical report UU-CS- 2012-018, Department of Information and Computing Sciences, Utrecht University.
 +
 
 +
Harte, C., M. Sandler, S. Abdallah, and E. Gómez. 2005. “Symbolic representation of musical chords: A proposed syntax for text annotations.” In ''Proceedings of the 6th International Society for Music Information Retrieval Conference (ISMIR)'', 66–71.
 +
 
 +
Harte, Christopher. 2010. “Towards automatic extraction of harmony information from music signals.” Ph.D. diss. Queen Mary, University of London.
 +
 
 +
Mauch, Matthias. 2010. “Automatic Chord Transcription from Audio Using Computational Models of Musical Context.” Ph.D. diss. Queen Mary University of London.
 +
 
 +
Pauwels, Johan, and Geoffroy Peeters. 2013. “Evaluating automatically estimated chord sequences.” In ''Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)''. Vancouver, British Columbia, Canada.

Revision as of 01:33, 25 August 2024

Description

This task requires participants to extract or transcribe a sequence of chords from an audio music recording. For many applications in music information retrieval, extracting the harmonic structure of an audio track is very desirable, for example for segmenting pieces into characteristic segments, for finding similar pieces, or for semantic analysis of music. The extraction of the harmonic structure requires the estimation of a sequence of chords that is as precise as possible. This includes the full characterisation of chords – root, quality, and bass note – as well as their chronological order, including specific onset times and durations. Audio chord estimation has a long history in MIREX, and readers interested in this history, especially with respect to evaluation methodology, should review the work of Christopher Harte (2010), Pauwels and Peeters (2013), and the “Utrecht Agreement” on evaluation metrics. For python evaluation code, please refer to “mir_eval”.

What's new

Compared to previous years:

  • Submission format: docker image is required. See the submission format section.
  • An online leaderboard: there will be an online leaderboard for the evaluation result on a small validation set. Details will follow when the submission site opens.
  • Potential new test datasets: we are discussing possibilities with dataset holders to use external proprietary datasets for validation and test sets. The dataset will be private. Details about the dataset's information (number of songs, genre distribution, etc) will follow shortly.

Data

Two datasets are used to evaluate chord transcription accuracy.

Isophonics
The collected Beatles, Queen, and Zweieck datasets from the Centre for Digital Music at Queen Mary, University of London (http://www.isophonics.net/), as used for Audio Chord Estimation in MIREX for many years. Available from http://www.isophonics.net/. See also Matthias Mauch’s dissertation (2010) and Harte et al.’s introductory paper (2005).
Billboard
An abridged version of the Billboard dataset from McGill University, including a representative sample of American popular music from the 1950s through the 1990s. Available from http://billboard.music.mcgill.ca. See also Ashley Burgoyne’s dissertation (2012) and Burgoyne et al.’s introductory paper (2011). Parsing tools for the data are available from http://hackage.haskell.org/package/billboard-parser/ and documented by De Haas and Burgoyne (2012).

Training and Testing

The training and testing divisions differ for the two data sets. The Isophonics has been available publicly for so long that it no longer makes sense to offer a separate training phase; as such, the entire data set will be used for testing, as in previous years. In contrast, in order to support MIREX, a portion of the Billboard ground truth has been withheld from the public. Submissions may train on all of the songs that have been publicly released so far: the MIREX servers have access to the ground-truth annotations and the original audio. Whether trained or not, all submissions will be tested against a fresh set of 200 songs that have never been released publicly.

The ground-truth files contain one line per unique chord, in the form {start_time end_time chord}, e.g.,

...
41.2631021 44.2456460 B:maj
44.2456460 45.7201230 E:maj
45.7201230 47.2061900 E:7/3
47.2061900 48.6922670 A:maj
48.6922670 50.1551240 A:min/b3
...

Start and end times are in seconds from the start of the file. Chord labels follow the syntax proposed by C. Harte et al. (2005). Please note that the syntax has changed slightly since since it was originally described; in particular, the root is no longer implied as a voiced element of a chord so a C major chord (notes C, E and G) should be written C:(1,3,5) instead of just C:(3,5) if using the interval list representation. As before, the labels C and C:maj are equivalent to C:(1,3,5).

Evaluation

To evaluate the quality of an automatic transcription, a transcription is compared to ground truth created by one or more human annotators. MIREX typically uses chord symbol recall (CSR) to estimate how well the predicted chords match the ground truth:

In previous years, MIREX has used an approximate CSR calculated by sampling both the ground-truth and the automatic annotations every 10 ms and dividing the number of correctly annotated samples by the total number of samples. Following Christopher Harte (2010, §8.1.2), however, we can view the ground-truth and estimated annotations as continuous segmentations of the audio and calculate the CSR by considering the cumulative length of the correctly overlapping segments. This way of calculating the CSR is more precise, as the precision of the frame-based method is limited by the frame length, and computationally more efficient, as it reduces the number of segment comparisons. Because pieces of music come in a wide variety of lengths, we will weight the CSR by the length of the song when computing an average for a given corpus. This final number is referred to as the weighted chord symbol recall (WCSR).

Chord Vocabularies

We propose a set of single chord evaluation measures for MIREX that extends the previous iterations of MIREX and combines it with evaluation measures proposed in the literature, providing a more complete assessment of the transcription quality. Following Pauwels and Peeters (2013), we suggest using the CSR with five different chord vocabulary mappings.

In each of these calculations, the full chord descriptions of either the estimated or the ground-truth transcriptions, which might contain complex chord annotations, would be mapped to the following classes:

  1. Chord root note only;
  2. Major and minor: {N, maj, min};
  3. Seventh chords: {N, maj, min, maj7, min7, 7};
  4. Major and minor with inversions: {N, maj, min, maj/3, min/b3, maj/5, min/5}; or
  5. Seventh chords with inversions: {N, maj, min, maj7, min7, 7, maj/3, min/b3, maj7/3, min7/b3, 7/3, maj/5, min/5, maj7/5, min7/5, 7/5, maj7/7, min7/b7, 7/b7}.

With the exception of no-chords, calculating the vocabulary mapping involves examining the root note, the bass note, and the relative interval structure of the chord labels. A mapping exists if both the root notes and bass notes match, and the structure of the output label is the largest possible subset of the input label given the vocabulary. For instance, in the major and minor case, G:7(#9) is mapped to G:maj because the interval set of G:maj, {1,3,5}, is a subset of the interval set of the G:7(#9), {1,3,5,b7,#9}. In the seventh-chord case, G:7(#9) is mapped to G:7 instead because the interval set of G:7 {1, 3, 5, b7} is also a subset of G:7(#9) but is larger than G:maj. If a chord cannot be represented by a certain class, e.g., mapping a D:aug or F:sus4(9) to {maj, min}, the chord is excluded from the evaluation if it occurs in the ground-truth, and it is considered a mismatch if it occurs in an estimated annotation.

Most frequent chord qualities in the McGill Billboard corpus.
Quality Freq. (%) Cum. Freq (%)
maj 52 52
min 13 65
7 10 75
min7 8 83
maj7 3 86
5 2 88
1 2 90
maj(9) 1 91
maj6 1 92
sus4 1 93
sus7 1 94
sus9 1 94
7(#9) 1 95
min9 1 96

Our recommendations are motivated by the frequencies of chord qualities in the Billboard corpus (see table above), which is a balanced sample of American popular music from the 1950s through the 1990s (J.A. Burgoyne, Wild, and Fujinaga 2011). Pure major and minor chords alone account for 65 percent of all chords encountered, whereas augmented and diminished triads account for 0.2 percent or less of the corpus each. Our arguments for our particular seventh-chord vocabulary as opposed to the set of all tetrads follows similar reasoning; our proposed vocabulary accounts for 86 percent of all chords, whereas no other standard type of seventh chord accounts for more than 0.2 percent of the corpus. In future years, the table suggests that we might consider introducing vocabularies including power chords, and possibly suspended chords or added sixths and ninths as well.

Chord Segmentation

Besides CSR, the chord transcription literature includes several other metrics for evaluating chord transcriptions, which mainly focus on the segmentation of the automatic transcription. We propose to include the directional Hamming distance in the evaluation. The directional Hamming distance is calculated by finding for each annotated segment the maximally overlapping segment in the other annotation, and then summing the differences ((S. A. Abdallah et al. 2005); (Mauch 2010, §2.3.3)). Depending on the order of application, the directional Hamming distance yields a measure of over- or under segmentation. Both directions can be combined to yield an overall quality metric (Christopher Harte 2010, §8.3.2):

Submission Format

Audio Format

Audio tracks in the training directory will be encoded as 44.1 kHz 16bit mono WAV files.

I/O Format

The algorithms should output text files with a similar format to that used in the ground truth transcriptions. That is to say, they should be flat text files with chord segment labels and times arranged thus:

start_time end_time chord_label

with elements separated by white spaces, times given in seconds, chord labels corresponding to the syntax described by C. Harte et al. (2005), and one chord segment per line. As in all benchmarks after 2008, end times are a mandatory component of the output. For the evaluation process we will assume enharmonic equivalence for chord roots. We will no longer accept participants who would only like to be evaluated on major/minor chords and want to use the number format.

Command line calling format

Submissions using machine learning models must also submit their trained models. Training on the evaluation server is no longer supported starting from this year. We will execute the following commands for testing:

prepare.sh
doChordID.sh "/path/to/input1.wav" "/path/to/output1.wav.txt"
doChordID.sh "/path/to/input2.wav" "/path/to/output2.wav.txt"
...

In the results directory, there should be one file for each testfile with same name as the test file + .txt. Programs can use their working directory if they need to keep temporary cache files or internal debugging info. Standard output and standard error will be logged.

No internet access is allowed during the inference stage (doChordID.sh). Please contact us if your model requires internet access (e.g., model API call) during inference.

Packaging submissions

  • Every submission must be packed into a docker image
  • Every submission will be deployed and evaluated automatically with docker run

Accepted submission form:

  • Link to public or private Github repository
  • Link to public or private docker hub
  • Shared google drive links
  • If the repository is private, an access token is also required

Time and Hardware limits

A Linux server with one Nvidia GeForce RTX 3090 is used for evaluation. CPU, OS, and memory specifications will be announced later.

Time limit: within 5 times the total duration of the test set.

Bibliography

Abdallah, Samer A., Katy Noland, Mark B. Sandler, Michael Casey, and Christophe Rhodes. 2005. “Theory and Evaluation of a Bayesian Music Structure Extractor.” In Proceedings of the International Society for Music Information Retrieval Conference, 420–425.

Burgoyne, J. A., J. Wild, and I. Fujinaga. 2011. “An expert ground truth set for audio chord recognition and music analysis.” In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), 633–638.

Burgoyne, John Ashley. 2012. “Stochastic Processes and Database-Driven Musicology.” Ph.D. diss. Montréal, Québec, Canada: McGill University.

Haas, W. B. de, and John~Ashley Burgoyne. 2012. Parsing the Billboard Chord Transcriptions. Technical report UU-CS- 2012-018, Department of Information and Computing Sciences, Utrecht University.

Harte, C., M. Sandler, S. Abdallah, and E. Gómez. 2005. “Symbolic representation of musical chords: A proposed syntax for text annotations.” In Proceedings of the 6th International Society for Music Information Retrieval Conference (ISMIR), 66–71.

Harte, Christopher. 2010. “Towards automatic extraction of harmony information from music signals.” Ph.D. diss. Queen Mary, University of London.

Mauch, Matthias. 2010. “Automatic Chord Transcription from Audio Using Computational Models of Musical Context.” Ph.D. diss. Queen Mary University of London.

Pauwels, Johan, and Geoffroy Peeters. 2013. “Evaluating automatically estimated chord sequences.” In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver, British Columbia, Canada.