2026:Audio Instrument Recognition

From MIREX Wiki

Description

This page describes the MIREX 2026: Audio Instrument Recognition task.

The task is clip-level multi-label instrument recognition. Given a music audio excerpt, a submitted system should predict which instruments are present in the excerpt.

For an input audio excerpt X, the system outputs confidence scores over a fixed instrument label set:

Prediction(X) = [s_1, s_2, ..., s_K]

where K is the number of labels in the official instrument vocabulary, and s_k is the predicted confidence score for instrument k.

Instrument Label Set

The task will use a fixed instrument label set for evaluation. The label vocabulary is based on the OpenMIC-2018 instrument taxonomy, which contains the following 20 labels:

  • accordion
  • banjo
  • bass
  • cello
  • clarinet
  • cymbals
  • drums
  • flute
  • guitar
  • mallet_percussion
  • mandolin
  • organ
  • piano
  • saxophone
  • synthesizer
  • trombone
  • trumpet
  • ukulele
  • violin
  • voice

The final evaluated labels will be selected from this vocabulary according to coverage and annotation reliability in the official evaluation data. Labels with insufficient positive examples may be excluded from the official ranking.

Dataset-specific labels may be mapped to the official vocabulary when necessary. For example, labels such as "drum kit", "drums", and "drum set" may be mapped to "drums"; labels such as "synth" and "synthesizer" may be mapped to "synthesizer".

The official evaluation label set is expected to follow the 20-label OpenMIC-based vocabulary listed above. Any necessary dataset-specific label mapping or label exclusion after the final annotation audit will be documented and applied uniformly to all submitted systems.

Datasets

Training Datasets

There are no restrictions on the training data used by participating systems. However, each submission must clearly state the training data used in its system description or extended abstract.

Participants should report:

  • the names of all training datasets used;
  • whether OpenMIC-2018 was used for training, validation, threshold tuning, or model selection;
  • any external pretrained models used;
  • any additional data augmentation or post-processing steps.

Evaluation Datasets

The evaluation will include an official hidden evaluation set curated for this task. The hidden evaluation set will consist of music audio excerpts with clip-level instrument-presence annotations.

The hidden evaluation data will not be distributed to participants. Submitted systems will be run by the task organizers or through the MIREX evaluation infrastructure.

A public reference evaluation may also be reported using the official OpenMIC-2018 test partition. Results on this public reference set will be reported separately from the official hidden evaluation results.

Submission Format

Submissions should be packaged as a compressed file, such as .zip, .tar.gz, or .rar.

Each submission should contain at least the following files:

A) The main recognition script

The main recognition script should be executable from the command line. It may be a bash script, Python script, binary executable, or another clearly documented executable entry point.

The submitted system must take as input a directory of audio files and produce an output file containing predicted instrument scores for each audio excerpt.

Denoting the input audio directory as ${input_dir} and the output file path as ${output}, a program called foobar may be called as:

foobar ${input_dir} ${output}

or with flags:

foobar -i ${input_dir} -o ${output}

If the submission requires additional arguments, such as a model checkpoint path or configuration file, these should be clearly documented in the README file. For example:

python run_instrument_recognition.py -i ${input_dir} -o ${output} --checkpoint model.pt

B) The README file

Each submission must include a README file containing:

  • contact information;
  • installation instructions;
  • software and hardware requirements;
  • instructions for running the submitted system;
  • the exact command line to be used for evaluation;
  • information about required model checkpoints or external files.

The README should include at least one command line containing both ${input_dir} and ${output} so that the evaluation can be run automatically.

C) System description or extended abstract

Participants should submit a short system description or extended abstract. This document should summarize the model architecture, training data, external pretrained models if used, and important preprocessing or post-processing steps.

Input Data

Participating systems will receive a directory containing audio files.

The expected input audio format is:

  • Audio format: WAV
  • Sample rate: 44.1 kHz, unless otherwise specified
  • Bit depth: 16-bit PCM
  • Number of channels: mono or stereo

The final input format will be confirmed before evaluation.

Output Data

The submitted system must produce one output file containing predictions for all input audio files.

The preferred output format is a tab-separated text file. Each line should contain an audio filename, an instrument label, and a confidence score:

<filename>\t<label>\t<score>

where <score> is a real-valued confidence score, preferably in the range [0, 1].

Example:

track_001.wav	piano	0.93
track_001.wav	violin	0.81
track_001.wav	drums	0.76
track_001.wav	guitar	0.12
track_002.wav	flute	0.88
track_002.wav	piano	0.64
track_002.wav	cello	0.21

The instrument labels in the output must match the official label set exactly. If an audio-file/label pair is missing from the output, it may be treated as having score 0.

Systems are encouraged to output confidence scores for all official labels for each input audio file. If a system only produces binary predictions, it may output 0/1 values instead of continuous confidence scores.

Evaluation

The official evaluation is clip-level multi-label instrument recognition.

For each input excerpt, the system predicts a confidence score for each instrument in the official label set. These predictions will be compared with the ground-truth instrument labels for that excerpt.

The primary ranking metric will be:

Macro-averaged F1 score

Macro-F1 computes F1 separately for each instrument class and then averages across classes.

The following additional metrics may also be reported:

  • Micro-F1
  • Mean Average Precision (mAP), when confidence scores are available
  • Per-instrument precision, recall, and F1
  • Per-instrument average precision, when confidence scores are available

The official leaderboard will be determined by clip-level macro-F1 on the hidden evaluation set. Results on any public reference dataset will be reported separately.

For confidence-based outputs, scores will be converted to binary predictions using a fixed threshold of 0.5. Scores greater than or equal to 0.5 will be treated as positive predictions; scores below 0.5 will be treated as negative predictions.

Minor updates to the evaluation protocol may be made after the final data audit. Any changes will be announced on this page before the official results are released.

Time and Hardware Limits

Due to the potentially high number of participants in MIREX audio tasks, runtime and hardware limits may be imposed.

Submissions should be able to run within the limits specified by the task organizers. Submissions that exceed the time limit, require unsupported hardware, or cannot be run according to the provided README may not receive an official result.

Participants should clearly state any special hardware requirements, such as GPU requirements, in the README file.

Questions?

For questions about this task, please contact:

Bibliography

1. E. J. Humphrey, S. Durand, and B. McFee, “OpenMIC-2018: An open dataset for multiple instrument recognition,” in Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 2018, pp. 438–444.

2. J. J. Bosch, J. Janer, F. Fuhrmann, and P. Herrera, “A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals,” in Proceedings of the 13th International Society for Music Information Retrieval Conference, Porto, Portugal, 2012, pp. 559–564.

3. R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and P. Bello, “MedleyDB: A multitrack dataset for annotation-intensive MIR research,” in Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 2014, pp. 155–160.

4. D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The MTG-Jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning, Long Beach, CA, USA, 2019.