2006:Evalutron6000 Issues
Contents
Introduction
Standby for an explication of this page and the Evalutron6000 setup for the Audio Similarity and Retrieval and Symbolic Melodic Similarity tasks.
Description as drawn from the Institutional Review Board documents
Task Description: For the ΓÇ£similarity-basedΓÇ¥ class of MIR/MDL tasks, the submitted music retrieval systems are being tested whether (or not) they can find one or more songs ΓÇ£similarΓÇ¥ to the standardized ΓÇ£queryΓÇ¥ songs that were selected by IMIRSEL. For each ΓÇ£query song,ΓÇ¥ each retrieval system under evaluation gives IMIRSEL a list of songs that it ΓÇ£thinksΓÇ¥ are similar to the ΓÇ£query songΓÇ¥. These songs are called ΓÇ£candidate songsΓÇ¥ (or simply, ΓÇ£candidatesΓÇ¥). IMIRSEL gathers up these lists of candidates and then makes a ΓÇ£master listΓÇ¥ of all the candidates for each query song. At this point, we then turn to volunteers drawn from the MIR/MDL research community to act as similarity ΓÇ£graders.ΓÇ¥ The gradersΓÇÖ job is to determine whether each query/candidate pair is similar (or not). Once we have collected and aggregated the similarity judgments for each query/candidate pair, we can then go back and determine, for each system, how many similar (and not similar) songs were retrieved for each query. Once the aggregated ground-truth data has been establish, the evaluation procedures concerning the success (or failures) of the systems becomes automatic (i.e., no more human involvement) as the scoring of each system is done algorithmically by a series of computer programmes.
Location: IMIRSEL has developed a web-based similarity judgment collection system currently called the Evalutron 6000. The Evalutron 6000 is located on the servers of the Graduate School of Library and Information Science, UIUC. Screen shots of the system prototype are attached. The Evalutron 6000 is set up to:
1. Begin with a registration page that is designed to:
- 1.a. Ensure that each grader has read and agreed to the informed consent information before proceeding further
- 1.b. Establish a username/password pair which is needed for :
- 1.b.i. Data security and integrity
- 1.b.ii. Affording the graders the ability to ΓÇ£come and goΓÇ¥ from the system as they wish because the system uses the username to record which query/candidate pairs have already been evaluated by each user. This also allows graders to modify earlier judgments if they want.
- 1.b.iii. Affording us the ability to delete specific grader scores should a grader decide to withdraw or a security breach is detected, etc.
- 1.b.iv. Affording the algorithm underlying the system the ability to evenly distribute sub-sets of the query/candidate master lists across graders. This helps us minimize the burden on the graders (i.e., they need not evaluate every possible query/candidate pair).
- 1.c. Provide us with contact email address to help us communicate with the graders over such issues as:
- 1.c.i. System problems
- 1.c.ii. Verification of membership in the MIR/MDL research community, etc.
- 1.ciii. Confirmation of withdrawals from participation
2. Provide the graders with a set of web pages, one for each query song, that present:
- 2.a. The ability to hear, stop, start, rewind, etc. the query song
- 2.b. The sub-set listing (~15 to ~20) of the candidates for the query along with:
- 2.b.i. The ability to hear, stop, start, rewind, etc. each candidate song
- 2.b.ii. An input mechanism to indicate and record the graderΓÇÖs similarity judgment for each candidate
3. Use a robust, password-protected, PHP/MySql database system to:
- 3.a. Collect and maintain the query/candidate lists
- 3.b. Generate the candidate sub-set lists seen by the graders so that each candidate is graded by roughly equal numbers of graders
- 3.c. Collect and preserve the raw grader scores, response times, etc.
- 3.d. Process the raw grader scores into aggregate similarity ground-truth values
- 3.e. Collect and maintain the confidential grader identification and consent information
Time Commitment: Evaluation of a single query against a candidate list takes approximately 8-10 minutes, completing 20 queries will take a grader approximately 2 to 3 hours. Graders can stop at anytime and resume grading, allowing them to complete the entire process in stages.
Date and Duration Issues: This protocol is designed to generate similarity ground-truth data for the evaluation of MIR/MDL systems under the conditions that pertain to the MIREX evaluation tasks. These tasks are ever-evolving as new test collections are generated and new algorithms are developed. Because of this constant evolution, every so often, at unpredictable intervals, IMIRSEL will need to collect new ground-truth data sets under this protocol. Thus, in general, this protocol has no fixed end date.
Measures: Currently a query/candidate song pair is deemed to be similar by simple majority vote of the graders (e.g., if two out of three graders say the pair is similar then we will deem that pair to be similar). Future evaluation runs might be set up to include the ability to gather ΓÇ£relativeΓÇ¥ similarity scores (i.e., on a scale from, say, 1 to 10, etc.) and/or ΓÇ£comparativeΓÇ¥ scores (i.e., song A is more/less similar to the query than song B, etc.). None of the aforementioned possible scoring variants materially affects the underlying purpose of the protocol under consideration which is, again, the establishment of similarity values between query/candidate pairs.
Derivative aggregate data (i.e., not specific to any individual grader) concerning the data collection process will be analyzed for possible system improvements and generalizations. Such aggregate data would include, for example, descriptive measures of number of pairs judged, ratios of similar to not similar judgments, estimates of judgment times, etc.
About Graders and Data Integrity: The volunteer graders will be drawn from the MIR/MDL researcher community and NOT the general public. We want to use only those graders that have a vested interest in making honest similarity assessments. MIR/MDL researchers have this vested interest in honesty as the creation of valid ground-truth data and the subsequent evaluation of their MIR/MDL systems based upon valid similarity ground-truth provide for them a scientifically valid basis for future research and development. To further ensure data integrity, the protocol is set up to be ΓÇ£double-blindΓÇ¥ in the same manner that peer-reviewing is double-blind. In a very real sense, this protocol is peer-reviewing. Because IMIRSEL runs the submitted systems in-house against music collections unknown to the submitters, AND our protocol aggregates lists of query/candidate pairs from across all submissions, none of the graders knows from which system a particular query/candidate comes (i.e., it could even from their own system and they would have no way knowing!). The submitters of the systems under evaluation also have no way of knowing which graders provided similarity judgments for their systems as all the grading information is stripped of grader identifiers and then aggregated into a collective set of similarity values.
AudioSim Assumptions
As of 22 August 2006 here is a quick sketch of the parameter space that we are working under (subject to realities):
# of votes/eyes per candidate: 5 # of alogrithms evaluated: 6 # of candidates per query per algo: 5 # of queries: 30 # of comparisons: 4500 # of "graders": 20 # of comparisons per "grader": 225 # of seconds assumed per comparison: 45 # of total seconds to grade per grader: 10,125 # of hours to grade per grader: 2.812
As you can see, change one value, and the rest re-jiggers itself (sometimes in very nasty ways).
Potential Audio Simularity "Graders"
- J. Stephen Downie, IMIRSEL
- Andreas Ehmann, IMIRSEL
- Kris West, IMIRSEL
- Xiao Hu, IMIRSEL
- Mert Bay, IMIRSEL
- M. Cameron Jones, IMIRSEL
- Paul Lamere, IMIRSEL
- Martin McCrory, IMIRSEL
- Qin Wei, IMIRSEL
- Beth Logan
- Mark Levy, QMUL
testing