Issue 39: Judging the Judges

Why should we judge the Judges? Competition is done, scores are posted, who cares?

The first time I heard this question was after my first “attempt” as a shadow judge at an international competition in Le Touquet, France in 1991. I wasn’t sure judging would ever be something for me, and I wanted to know how one could judge judging quality. After judging in some 65 international events (including 5 World cups, 7 Euro cups, 3 French, 1 German, numerous Dutch nationals) I still wonder, and every now and then I hear remarks about “good” or “bad” judges, so it seems people must still be judging the judges.

Regardless of how it is done, competitors and others seem to strongly believe it does make a difference to competition if judges are “good”.

If the goal of competition is to find the best flier, then a good judge should pick the best flight routine, following rules and criteria as set and score the others relative to that number one. If she or he is not able to do it, “competition” is no longer competition.

Defining “good judging” is not easy though, and some might say it is probably impossible.

Which “activities” of the judges could (or should) be checked?

The first one might be knowledge of rules and guidelines, of flying a kite or all kites. We can test the knowledge about rules etc., taking a kind of exam. It would do no harm if we tried that, but I am not sure it would make a big difference. On the field the rules etc. play a more important role in establishing a format for the results, more than in actually judging the figures or routines.

Second, we can check a judge’s knowledge about kite-flying, which is of course essential. We might use videos to check if the judge sees what “we all see”. Personally I like the training sessions we have had in the Netherlands every now and then, where I learn probably more from the pilots than they learn from me as a judge. Discussing the differences between what I see and know, and what they fly or claim to fly is good and needed feedback. We might assume the flier knows enough about flying, it does not always mean though that the different talent to judge is there too..)

Experience is not too difficult to check. And someone having judged often in the past might mean they are good at it, or that they are a popular person.

Having a judge can explain his or her conclusions during debriefings could be another activity we might look into, kind of like checking their “bedside manners”. To me this feedback to and from fliers is the most important part of judging (judge’s) activities. Explaining (I definitively do not mean arguing!) your views and opinion to other judges enables you to learn. Talking with competitors, or answering their questions can give you a good idea about what the flier wanted to show you, which might be quite different from what you have seen (and the flier will appreciate this feedback from a trained observer, especially if it comes from a ‘good’ judge). It is a less numerical way to address the judges’ quality, usually far more informative to both fliers and judges than the “why did I score …” question. Judging (de-)briefings are an essential part of judging a competition, general debriefings (and the discussions with pilots just after) are just as essential for competition in general.

That last “activity”, the scoring, is the most obvious to check and most questioned, but also the most difficult part of judging to interpret. Obvious, because it is the only visible “result” of the judging process. Difficult, because the numbers alone lose a great deal of meaning without the “attached” flier and judge.

It might be good to analyze these numbers… You should actually judge yourselves in the same competition, but then you would also double the problem.

As in real life (I know, for some of you kite flying is real life), then judges might have the best opportunity to judge the judges. Sure, to compare your own conclusions about your flying with that of the judge (“why is my score so low”) can give you some idea of judging, but only comparison with the other judge’s scores really shows the value of ‘your’ score.

So that is why I (as a judge/scorer in a competition), tend to combine all scores and analyze what has happened with scoring. I do not presume scores are objective, or that judges will always score the same routine with the same number. Scores are just as much an opinion as a conclusion and so it will always differ between judges.

But judges are asked in the rules to be objective, and judges usually try to be as objective as possible, so if they really were objective, just one judge would suffice.

The first thing I do is to calculate the average scoring of each judge, and so be able to compare a “low” or “high” score with that average. It might show that a “low” score of one judge might get you a higher ranking than the “high” score of another!

I check the “spread” in scores for each judge to find which judge stays a bit in the middle, and which one is more extreme in hers or his scoring.

I then look for the differences in ranking for each judge, since finding the best, and second and third best, is the more interesting part of judging for the fliers!

Combined, this information gives me some insight regarding how well the judges agree. Which routines and figures we agree on or might give cause for discussion, and which routines and figures for which the quality is appreciated roughly the same.

It is in the points that judges seriously disagree upon where we can find what troubled the judges, how well they succeeded in their strive for objectivity, and even what elements in routines haven’t found their place yet, like some new tricks.

Most interesting is the analysis of the scores for compulsory figures. Short, simple -for the judges- and well described, they should result in very similar scores from a panel of “objective” judges.

Flying is not done to please the judges. Judges deliver a service to fliers to establish who is the best competitor in each discipline. Of course the “tools” must make that possible. The agreement between fliers and judges (the rules and guidelines) must have a form and content that allows judges to work with them. To give an example, doing something totally new (and difficult) will no doubt impress other fliers, but having “originality” in the rules as criterion might eventually deal more with the knowledge of the judge than with the ability of fliers.

The way compulsory figures are defined is another example. Definition of figures, and the figures themselves have changed over the years, and it seems not for the better. It must be, I think, because over the last years (I kept track for 13 years) the differences in scoring of compulsories have grown steadily and considerably. To compare two comparable events, the world cup in Long Beach USA in 1998 showed a maximum difference of 20 points (one compulsory by one team, two different judges); in the team event in Berck, France this year (2004) it was 48 (and it was just as bad in Euro cup this year). Of course, part of the problem is the diminishing time that is spent on actually discussing compulsories and rules amongst judges at big events, from more than 20 hours in Guadeloupe, about 15 in Long Beach to barely 3 hours at Euro cup.

Monitoring the quality of judging might not be so interesting for competitors. The competitors may just need to trust that the judges will declare the best flier as number one. Other, older, judged sports show that when that trust is lost, establishing the actual quality of judging is difficult, certainly if it has not been done before, seriously. Judging kite acrobatics is about as difficult as it can get (in team ballet: full ‘3D’, 3 or more kites, five minutes totally free, no prescribed structure or format, no previous knowledge) but the end result of that judging is simply a list of competitors, the best performance on top.

When I started this analysis in 1994, the main reason to do so was to assure other judges that their fear of having given their friends or fellow countrymen an unfair advantage was unjustified. In almost all cases, judges are too strict to the people or routines they know very well, and only a very few actually show any bias. Over the last years I have seen a gradual change in this (in Europe). The lack of exchange between countries (a lack of international competition) has not given judges enough opportunity to compare different styles and ideas, and more and more they start to see differences in style and ideas as differences in quality.

Judges (both the flier/judges and the ones who “just” judge) should keep an eye on each other, outside the field of course, to maintain and improve the quality of judging. Analysis of scores, good judges meetings and debriefings, and talking to competitors will help. Maybe even flying a kite every now and then might help.

Best winds,

Hans Jansen op de Haar

P.s. – For those interested in the ‘statistical’ analysis of scores, a spreadsheet with explanations, containing the public scores of the Berck 2004 event as an example is available, just drop me a message by clicking on my name above.

P.p.s – I have been in kite acrobatics since 1988, first as team pilot (Dike Hoppers), and since 1991 as judge. I have judged thousands of routines and even a lot more figures. Over the last 35 years I have been interested in cognitive and design-processes, intuitive reasoning, and artificial intelligence. In my former job as building cost engineer (the actual Dutch profession title and position are hard to translate) being able to analyze numbers is essential. The main drive to put these thoughts toward improving judging and competition is, of course, friendship!