2007:Evalutron6000 Walkthrough For Audio Mood Classification
Contents
Special Comments about Grading the Symbolic Melodic Similarity Submissions
Special Update on Candidate Files (3 September 2006)
Nota Bene (The following description of the Mixed and Karaoke candidates has been updated, 3 September 2006). The length of candidates for the polyphonic task, drawn from the "Mixed" and "Karaoke" collections (Queries #7 through #17) is now more consistently in the 30 second range. We discovered that the timing information returned was not a reliable indicator of where the "best match" was found in each of the candidates. To correct for this, we took the "best match" starting information and then "padded" the candiate on both sides of the "best match" start point. Thus, the first 10 or so seconds of each candidate might, or might not, represent a strong "hit". The last 10 or so might also start "wandering away" from the "best match" region. Therfore, we recommend that you listen to the entire candidate before giving your grade and to be mindful that the "best match" section in each candidate is a bit slippery.
Task Description: Focusing on Melody
The goal of Symbolic Melodic Similarity Task is to evaluate how well various algorithms can retrieve results that are MELODICALLY similar to a given query. You will find in the candidate files a variety of different instrumentations as set by the creators of the MIDI files. We need you to look beyond the differences in timbre and instrumentation in assigning your grading scores.
Grading Expectations and "Reasonableness"
For each candidate, we need you to assign BOTH a Broad Category score AND a Fine Score (i.e., a numeric grade between 0 and 10). You have the freedom to make whatever associations you desire between a particular Broad Category score and its related Fine Score. In fact, we expect to see variations across evaluators with regard to the relationships between Broad Categories and Fine Scores as this is a normal part of human subjectivity. However, we will be using the two different types of scores to do important inter-related post-Evalutron calculations so, please, do be thoughtful in selecting your Broad Categories and related Fine Scores. What we are really asking here is that you apply a level of "reasonableness" to both your scores and your associations. For example, if you score a candidate in the VERY SIMILAR category, a Fine Score of 2.1 would not be, by most standards, "reasonable". Same applies at the other extreme. For example, a Broad Category score of NOT SIMILAR should not be associated with a Fine Score of, say, 7.2 or 8.4, etc.
Evalutron 6000 design and use details follow below and should clarify for you the use of the Broad Category and Fine Score input systems.
Clarification
A recent question about the Evalutron scoring system prompted the following response from me. In case others were confused, I think it best to share what I said with the rest of the grading community.
My response:
The "Broad Category" and the "Fine Score" are meant to express the *same* grade only in different ways. Think of exams you might have taken: Some professors might give an A or B or F (i.e., Broad Category). On the same exam, another professor might want to express the same grade as 99%, 78% or 20% (i.e., Fine Score).
We need the graders to be "two professors in one" . We are asking each grader to be a "broad grading" professor and also a "fine grading" professor. On our fine scale, 0 is meant to represent complete failure and 10 a perfectly similar success. With the set of paired "Broad" and "Fine" scores we can can do different kinds of post-Evalutron analyses with each kind of score. Also, we want to be able to take both kinds of scores and analyze them to see what kind of fine scores are associated with each broad score.
Coming and Going: Avoiding Grader Fatigue
Listening to, and then comparing, hundreds of audio files is very tiring. We have built into the back-end of the Evalutron 6000 system a rather robust database system that records in near-real-time your grading scores. These scores are saved along with information about which queries and candidates you have yet to review. All this information is stored in association with your personal sign-in ID. This means you can break up your grading over several days at times convenient and productive to you. In fact, we recommend that you not try to tackle your "assignment" in one big chunk as fresh ears are happy ears and happy ears make for better evaluations.