![]() |
||||||
|
SHAKSPER 2007: Shakespeare Golden Ear Test
From: Hardy M. Cook (editor@SHAKSPER.NET) Date: 08/01/07
The Shakespeare Conference: SHK 18.0492 Wednesday, 1 August 2007 From: Ward Ward <ward.elliott@claremontmckenna.edu> Date: Wednesday, 01 Aug 2007 00:24:25 -0700 Subject: 18.0473 Shakespeare Golden Ear Test Comment: RE: SHK 18.0473 Shakespeare Golden Ear Test Our warmest thanks to the 80 brave SHAKSPERians who took our Golden Ear Test, Round 1, and to Hardy for making them all reachable by us. We hope that SHAKSPERians will indulge us in an oversize 20-page posting, giving a preliminary analysis of the results of the test, which test ran for ten days, from July 6 through July 15. It's now OK to discuss the test online, as well as off, but try not to give away too many of the test's specifics, to preserve some usability for it in the future. The test is still up, http://goldenear.cmc.edu, but we haven't been monitoring it since July 15 and would suppose that any results since then are likely to be retakes to get another look at the test. Our two-sentence conclusion is this: As individuals, on average, the whole SHAKSPER group got almost two out of three unrecognized passages right (63%), and the top 30% got almost three out of four right (74%). As an aggregated group, the whole group got almost four out of five right (79%), and the top 30% got almost five out of six right (82%). For a two-paragraph conclusion, scroll down to Table II, and then to Section XI, Conclusions, and don't miss the cautions and further thanks at the very end. For 20 pages of further detail, read on. I. Who took the test? 31 of the 80 takers considered themselves Shakespeare pros (39%) and 49 considered themselves amateurs. We would guess that almost all were from SHAKSPER, though there was also an admixture of takers from HLAS, a rowdy, free-wheeling sister group of Shakespeare authorship buffs, most of whom also belong to SHAKSPER but are less likely to be pros than SHAKSPERians and, on their own turf, are subject to none of Hardy's constraints. It's possible that a few of these do not belong to SHAKSPER, but we don't think it matters much. HLAS people are scarcely less hooked on Shakespeare, and, hence, almost as high on our list of people who might take a test like ours seriously and give us interesting results. 26 respondents described themselves as critics, 14 as writers, 33 as artists, including performing artists, and 8 as "other humanities [than literature], or social, or natural sciences." 80 is about 6% of SHAKSPER's membership, a respectable showing. To encourage wider participation, we did not require test takers to give their names, and most did not, but 14 takers (18%) did identify themselves as willing, and, in some cases, eager, to take our Round 2 test. 12 of these were Rated, that is, they scored Bronze or better on the test. We did not encounter many, if any, A-List Shakespeare celebrities, no Harold Blooms or Stephen Greenblatts among the self-identifiers, nor many, if any, of SHAKSPER's most vocal past advocates of shutting down your computers and listening to your intuition only -- but we can't exclude the possibility that some Shakespeare grandees or intuitionists might have taken the test anonymously. Some A-list people helped us design the test; we would not expect any of these to have taken it. II. Gross Accuracy as Found: Two out of three for the group, four out of five for the top 30% Of the 80 who took the test, 24 (30%) were rated Bronze or better, with the following distribution: Table I. Rated Takers by Category, All Ratings Gross Golden Silver Bronze Total Identified 4 4 4 12 Unidentified 3 2 7 12 Sh. Pro 6 1 5 12 Sh. Amateur 1 5 6 12 Total 6 11 24 Rated players got four out of five identifications right, in gross figures, three out of four right in net figures, and did better on non-Shakespeare than on Shakespeare. Gross figures count all correct answers, whether recognized from memory or detected by intuition; net figures subtract the recognized answers and count intuitive answers only. Net figures are the interesting ones, but they are harder to arrive at than gross and are not available for all purposes. Of the rated takers, neatly, half considered themselves Shakespeare pros, the other half amateurs. Half identified themselves, half did not. All of the ratings are slightly inflated because they are based on gross, not net accuracy. Subtracting recognized passages would reduce most Golden Ears to Silver, and most Silver to Bronze. The whole group, on average, got about two out of three identifications right, both in gross figures and in net, since the whole group recognized fewer passages, on average, than the Rated players. Like the Rated players, the whole group did better on non-Shakespeare than on Shakespeare. The average gross score of all 80 takers was 18.6 of 28 (66%); their net score equivalent would be about a point lower, 17.6 (63%). The 31 pros in the whole group scored between 14 and 25, averaging 18.9 right of 28 questions (68%, gross). The 49 amateurs scored between 14 and 24, averaging 18.4 right (66%, gross). It is not surprising that the pros did better, on average, than the amateurs. It is surprising that the gap is so small, especially considering that these are gross scores uncorrected for passages recognized by the test-taker, which one would expect to be more common among pros than among amateurs. In fact, the average gross accuracy scores of every subgroup - critics, writers, artists, others -- we tested fell into an extremely narrow range, none lower than 18, none as high as 19 (Table X, below). Should the two-out-of-three or three-out-of-four individual gross accuracy levels we found be considered high or low? Judging from post-test lamentations we have heard, scattered mutterings about "humbling experience" and "stupid test," and low self-identification rates even of high-scoring players, we would guess that many, perhaps most, takers were disappointed with their scores, expected to do better, and didn't want to put their names to their test results. Our further guess is that they reacted to our test much as law students were expected to do to an experiment/demonstration routinely performed in Evidence classes in our day, decades ago. A two-minute dramatic event is staged, and the students are asked to describe what happened. Who was chasing whom? What did they say? What color were the first one's eyes? How tall was the second one, and what did he weigh? What color were his socks? The answers, when collected, always turn out to be all over the lot and filled with inaccuracies, and the class is supposed to learn from the "humbling experience" that eyewitness testimony is not very reliable. We followed a different model in our Golden Ear experiment (below). Our expectations of individual performances were not so high, and we believe that SHAKSPERians have done quite a bit better as a group than most of them think. The average individual outcomes as found struck us as par for the course or better; and those for rated players seem particularly impressive. Moreover, we now have further confirmation that the scores as found can be tweaked by screening and aggregation to reach surprisingly high levels of group accuracy, which have never, to our knowledge, been demonstrated before on this scale. Let's start with the outcomes as found. Two-out-of-three average accuracy, is far from perfect, but it is better than chance, better on non-Shakespeare than most of our individual computer tests on sizable texts (though not better than all of them combined), and better than all our computer tests combined on the very short, sonnet-length passages we used in the test. It is also, we would guess, better than three of our four past pilot-study groups mostly of Claremont Colleges students (below). The mean for the SHAKSPER group is barely a point and a half short of a Bronze. For reference, our preset range boundaries were: Golden Ear, 24-28 out of 28; Silver, 22-23; Bronze, 20-21. No one got a tin ear, 12 or less, because chance tends to pull all scores, high and low, toward the mean. If you don't recognize anything at all and guess at random, you still have 50-50 odds of getting each guess right, and you will much more likely get a 15 or a 16 than a zero. Getting a zero would be a remarkable feat, implying powers of discrimination comparable to what is needed to max the test with a 28. With a 35% average failure rate, about what SHAKSPER's was, you would expect 1.1 Tin Ear (below 12) purely by chance; we got none. You would also expect 1.1 Golden Ear (24+) purely by chance; we got seven, and would conclude that their success has to be more than pure luck. Of the 24 rated players, Bronze or better, the pro gross average was 22.7, amateur average, 21.3, all rated combined, 22.0. These amount to 81%, 76%, and 79% gross accuracy, respectively. Stated differently, the average rated pro player had a Silver Ear and could correctly ascribe four out of five passages; the average rated amateur could get three out of four - some of which in both cases, however, were from memory, not from intuitive detection. Again, we know of no computer test, or combination of computer tests, which could reach this level of accuracy in identifying such short passages. III. Gross accuracy tweaked upwards by aggregation and screening. Now let's get to the tweaking. We knew from our pilot study of a dozen Claremont students in a 2002 Shakespeare class that you could bring out a group's latent accuracy by screening for the best and averaging for the group. The best student in the group got 79% right; the top half of the group, horizontally aggregated by majority rule for each question, got 85% right; and the whole class, so aggregated, got 89% right. These are both gross and net figures because student recognition of the passages was close to zero. Could we do the same with a larger, more experienced panel of SHAKSPERians? The short answer is not quite the students' remarkable level of nine out of ten, but still an impressive five out of six, net, after removing all recognized passages. In our day the classic demonstration of the benefits of aggregation, since revealingly elaborated by James Surowiecki in The Wisdom of Crowds, 2004, was performed routinely in business school classes. The class would be asked to guess the number of beans in a jar, and, despite their best efforts to get it right, would find that their answers were all over the lot, just like the law students. But, far from dismissing it as another "humbling experience," the professor would go on to plot the guesses. If the class took the task seriously, as they generally do in business schools, the usual outcome was this: the guesses would form a bell curve, with the peak of the curve within five or ten percent of the actual number of beans in the jar. Thanks to the Wisdom of Crowds, the class as a whole would often equal or surpass the accuracy of its best member, and with far greater reliability and predictability, since no one can tell in advance who is going to make the best guess. Chance is much more of a factor with one considered guess than with many. Our adaptation of the Wisdom of Crowds to the Golden Ear test was not just to average takers' individual scores, but also to average their answers to each question and draw from these "horizontal" averages a collective group score. For the SHAKSPER takers, the two tweaks combined raised gross group accuracy dramatically, from an untweaked individual average accuracy of 66% to a tweaked collective average accuracy of almost 93%. Here's how we did it: We have seen that the gross vertical average of final individual scores of all takers was 18.6, with 66% correct answers. The collective gross score of all takers, first finding the group's majority horizontally on each question and then scoring the whole group's averaged answers vertically, just like a single test, was 22, or 79% correct answers. Simply aggregating the whole group's guesses on each question raised their gross accuracy from two out of three to four out of five. As it happens, the gross average of individual scores of all Rated (that is, Bronze or better) takers was also 22, exactly equal to the four-out-of-five accuracy of the whole group. Aggregation raised the whole group's gross accuracy 13% (i.e., 79% - 66%) to the same level as its top 30% unaggregated. Aggregating the top 30%, in turn, raised its gross accuracy another 13.9%, to 26 out of 28, 92.9% accuracy, one point better, as it happens than the anonymous top-scoring individual. As it further happens, gross aggregated accuracy was identical for Shakespeare and for non-Shakespeare. Unaggregated, the top players got a gross average of four out of five right; aggregated they got 13 out of 14 right. Both of the tweaks combined raised the group's gross collective score from 18.6 to an astonishing 26, and its gross collective accuracy from 66% to almost 93%, and it seems that both tweaks contributed about equally to the improvement. After these two tweaks, the experience might seem not quite so humbling. IV. Net Accuracy: After removing recognized passages, still tweakable to almost Five out of Six. But we need at least one further tweak, and this one brings the average back down a bit, to four out of five for the group, and five out of six for the Top-30% Rated players. All the gross numbers discussed so far make no allowance for recognition and treat every correct answer as if it came from intuition. Unless recognition is zero, as it almost was for our student pilot panels, but surely was not for our SHAKSPER panel, this is bound to overstate the power of intuition. We tried to avoid familiar passages, identify them, and exclude them where found. "Avoid" means that we tried to pick the least familiar passages, especially Shakespeare passages, we thought we could find. Jim Carroll's complaint (8 July) that our "Shakespearean passages have been chosen to minimize what is distinctive about Shakespeare: his interesting diction and his constant stretching for metaphorical expression" probably reflects this precaution, but we thought it far preferable to exclude archetypical passages like "Friends, Romans, countrymen" or "Shall I compare thee to a summer's day?" from a test of intuitive detection than to include them and either miscount them as true detection or toss them out as obviously remembered, in which case, you would have to wonder why we bothered to include them in the first place. We would guess that Shakespeare's most archetypical passages are also his most familiar, and they are more likely to be tests of memory than of intuition. If we succeeded in choosing unfamiliar passages, the ones we chose are no doubt less distinctive than passages randomly selected from Shakespeare -- but the alternatives, choosing familiar passages, or even choosing at random without regard to familiarity, would have been too big a waste of our takers' time and our own. Steering clear of the most familiar passages alone was enough in our pilot studies to make sure that only one of the students recognized even one passage. The students' net accuracy was therefore essentially the same as their gross accuracy because, for them, almost every passage was a case of first impression. However, as we shall see, it wasn't enough for SHAKSPER, especially for Shakespeare passages, and we had to correct for it. "Identify" means that we asked takers to tell us outright whether they recognized each passage. The responses showed us that, with a group as sophisticated as SHAKSPER, our efforts to avoid familiar Shakespeare passages were not always successful. Our worst choice from this standpoint was a passage from Twelfth Night, which was recognized by 45% of all the takers and 75% of the Rated takers. Two other play passages and one Shakespeare sonnet got 20-30% recognition from the whole group and 40-50% recognition from the Rated players. The other Shakespeare questions averaged maybe 7% recognition for the whole group, 18% for the ranked group. The overall average recognition rate was three or four times higher for Shakespeare than for non-Shakespeare, and twice as high for rated players as for the group as a whole. That is, on average, 15% of the whole group and 29% of the rated group recognized our Shakespeare passages, and 4% of the whole group, and 8% of the rated group recognized the non-Shakespeare passages. By the same token, however, they did not recognize 70-85% of our Shakespeare passages, and 92-96% of our non-Shakespeare passages making these fully and properly testable by our methods. "Exclude" means we tried, using simplifying assumptions, to find a way to exclude recognized passages from our accuracy estimates. To get from gross accuracy, which is inflated by recognition, to net accuracy, which is not, we assumed that essentially all recognition identifications would be correct and subtracted them both from the group's total correct answer and from its total valid takes. Spot-checking the top two categories, who got a quarter of the recognitions, indicates that the assumption is not quite true, only 96% true. Four percent of their supposed recognitions were wrong. But the actuality is close enough to 100% to allow us to use it as a simplifying assumption, which would very slightly overstate net accuracy. It is preferable to calculating the impact of each recognition separately and manually for 2,240 answers, and far preferable to using gross accuracy only. This exclusion process lowered the average of individual percentages by one to eight percent, with the greatest reductions for Shakespeare and rated players, where recognition was very high, and the least for non-Shakespeare and the whole group, where recognition was lower. Overall reductions from gross averaged individual accuracy to net were 6% for the rated group and 4% for the whole group. Here is a global summary of all our averages, with averaged individual accuracy percentages above and aggregated group accuracy below: Table II. Average Gross and Net Accuracy Rates, Individual and Group All, gross All, net Rated, Gross Rated, net Shakespeare 66% 60% 76% 66% Non-Shakespeare 66% 66% 81% 80% All 67% 63% 79% 74% Aggregated (Group)79% 79% 93% 82% This is our most important, bottom-line table, which gives the vertical average of individual scores (that is, the sum of all correct answers divided by the sum of all unrecognized takes), above, and the horizontally, majority-rule-for-each question aggregated group score, below, gross figures to the left, net to the right. The key figures are now the net ones. What leaps out from it to our eye is (1) that netting out the recognized answers, unsurprisingly, cuts the Shakespeare accuracy percentages much more than the non-Shakespeare; (2) it narrows the gap between the Rated players' averaged individual scores and those of the whole group slightly, from 12 points to 11 points, thanks mostly to lower net Shakespeare recognition, and (3) surprisingly, it cuts the gap between the two groups' aggregated group scores from 14 percentage points to only three. Netting for recognition made no difference at all for the whole group's aggregated accuracy score of 79%, but for cut the Rated group's aggregated accuracy score from a dizzying 93% to 82% -- only slightly higher than the whole group's despite the rated group's much higher individual accuracy. I haven't fully worked this out with Valenza, but he thinks the basic forces at work are mathematical. For a fixed error rate the majority decision gets better as you increase the size of the voting population, and the convergence benefits should be less pronounced with a smaller population, even if it is more skilled, especially if you have already squeezed out most of the group's latent accuracy by aggregation. If you are innocent, he notes, you would be better off in principle with a jury of four 60%-accurate jurors and a 1% chance of conviction than with a jury of two 80%-accurate jurors and a 4% chance of conviction. In practice, majority rule, different levels of skill among the test-takers, and different levels of difficulty among the passages complicate this. Again Valenza: "If students vote on true/false answers on a math test, and certain questions are out of the reach of all, aggregation won't help at all on such questions, so their aggregate hit rate will converge on the number of easy questions and stick there." See Section VI below for some striking examples of convergence among SHAKSPER respondents, both in getting right many passages that now look easy, and in getting wrong a few passages that now look hard. Whatever this says about the value of screening, it suggests that aggregation can still boost a group's net accuracy significantly, 16 percentage points for the whole group, six for the rated group, and that aggregated group accuracy is almost four out of five for the whole group and almost five out of six for the rated group. V. Other discounts: Honor system, replicability, choice of samples, and sample size. There were several important differences between our SHAKSPER panel and our prior student pilot panels. Though most of the students had read or seen several Shakespeare high school favorite plays like Julius Caesar, their recognition of our supposedly obscure passages was much, much lower. So were their stakes in the outcome, their eagerness to take the test, their expectations of their own performance, and their overall Shakespeare investments. They had nothing to lose from a low score, no incentive to pump up their scores, and little opportunity to do so either, since they all took the same test on paper at more of less the same time in more or less the same place and didn't get the answers till the tests were all in. SHAKSPER was a different matter. Its members are heavily invested in Shakespeare, often with a conspicuous attachment to one side or the other of a hot debate. Many of them trust much more to their intuition than to our stylometrics. Many are pros; a few are A-list Shakespeare celebrities. They are highly knowledgeable and perfectionist. They have hard-earned reputations (or at least hopes of them) which they can trade on and don't want to jeopardize. For many issues, especially abstract, symbolic ones or obscure, complicated, technical ones like Stylometry, it is often reputations, more than evidence, that seem to carry the day. One-upmanship, though less blatant and all-consuming than on HLAS, is still very much the coin of the realm on SHAKSPER. Many members have a stake in the outcome and much to lose from getting a known low score. This means that the incentives not to take the test or, taking it, not to rest content with a low score -- far less to let the results be bruited around -- are much stronger than they were for our students. It's hardly surprising that Shakespeare A-list grandees did not hasten to put their name to our test; they had much to lose and little to gain from it. And a web-based test like ours, which doesn't take your name before the test, and which tells you the correct answers afterward, is easier to take advantage of than a names-taken paper test of the same people in the same room at the same time. Anyone who offers such a test to an audience like SHAKSPER has to deal with tradeoffs between what you have to do to get people to take the test at all and what you have to do to keep them from giving biased or inflated results. Many social scientists would have wanted us to build in hard controls on bias and inflation: strictly randomize the takers; have a control group; don't tell anyone the answers, make sure they can't easily copy or Google the test; give everybody a name or a code and put cookies in their computers to make sure they can't take it twice; or, best of all, tell the ones from Canada and the UK not to take the test and make the others all come and get tested in the same room at the same time with a timer and a monitor present, just like the College Board, which has excellent reasons to take such precautions for a high-stakes test. Most of these hard controls seem to us inappropriate for a group like SHAKSPER, too off-putting, too impractical, too pointless, or too easy to get around. We chose soft controls. We tried to make the test as inviting, non-threatening, non-onerous, and rewarding as we could, short, net-based, and with as much anonymity and feedback as anyone could want. We tried to keep the perceived stakes as low as we could. We limited the experiment to ten days. We asked people not to retake the test or discuss the questions online while the test was going on. In short, we relied heavily on the honor system and soft controls to keep the test one of first impression. We think we did the right thing, and believe that test abuse was close to zero. We found only two obvious retakes, both innocent, and both later self-identified for us by the takers. The rest (admittedly a partly qualitative judgment) look legitimate to us. If there are a few fudged ones, it is extremely unlikely in a test with this many takers that they would change the outcome by more than a percent or so, or, perhaps more important, that the change, if any, would overstate the group's accuracy. None of the comments we have seen so far are concerned in the least with overstating the group's accuracy; the overwhelming concern has been with understatement. Three such criticisms, suggested by Jim Carroll, Bob Grumman, and a couple of offline correspondents are that the samples are too short (Carroll and the correspondents), the test too long (Grumman), and the Shakespeare passages insufficiently distinctive (Carroll). We have previously discussed the last point. If you are testing for Shakespeare detection, and not for Shakespeare recall, you want passages that not everyone knows, and it would hardly be surprising if these were less distinctive than, say, "Lay on, Macduff," or "my kingdom for a horse." As for lengthening the passages, it is possible that longer passages could be easier to identify by intuition, as they surely are by stylometrics. But the costs of using much longer passages seem to us prohibitive. The only practicable choices for a test like this are many passages and short or few passages and long. "Many and long" is not a real option because it would make the test much too long to take, even for SHAKSPERians. Judging from just a few e-mails from takers, our test, with 31 sonnet-length passages, takes about 15-20 minutes to finish with snap judgments, and up to 40 minutes or so if the judgments are more studied. Suppose we tripled or quadrupled our passage length to make it comparable to the shortest passages you can reasonably expect to test with computers, that is, to 500 words instead of 140. It's hard to imagine such an expanded test taking less than an hour and easy to imagine, not just Bob Grumman's, but every test taker's patience and, worse, their focus, wearing thin. And would we still hear complaints that the longer passages were also too short? We can't rule it out. It is not wise to make your test a Marathon if you want your takers to take it, finish it, and pay close attention to it all the way through. It's also possible, both with SHAKSPER and with HLAS, to imagine a few dedicated Marathoners whose patience and focus would not wear thin, and who would not just eyeball the text, but would spend whatever time it took to comb through the passages for tell-tale words like "Dunsinane" or "Osric," or, worse, comb it for those deplorable, countable stylometric tell-tales that people like us spend our days looking for, and that intuition is supposed to make superfluous - feminine endings, open lines, contractions, hendiadys, incongruous who's, and the like. Worst of all, they might Google the passage; it wouldn't be hard to do. Whatever you may think or say of such rational, left-brain deployment, it is not intuition, and long, high-stakes passages invite much more of it than short, lower-stakes passages. Though shortening the test, to, say, eight long passages, instead of 31 short ones, might at first glance seem to get the test back to a reasonable length, we would expect it, if anything, to raise the stakes on each passage, and, with it, the temptation to study each passage harder -- and much longer - and, again, supplement or supplant the right-brain intuition we are trying to test with left-brain deployment, which has nothing to do with intuition. What would our most predictable critic, Jim Carroll, say of a test with just three or four long Shakespeare passages? That it was just the ticket and welcome evidence of our growing methodological sophistication, or that it, too, had grave methodological shortcomings and was not ready for prime time - with far too few passages for a fair test of Shakespeare or anyone else, yet still was too much of a Marathon to ask of reasonable takers? There are many other serious problems with few-but-long passages - they are less broadly representative and, hence, more subject to variability; they are much more vulnerable to recognition; it's much harder to find ones that aren't; the distortion costs of using one like our unfortunate Twelfth Night passage that everyone turns out to know is much higher; and you would have a lot of explaining to do if you tried to make apples-to-apples comparisons of irreparably short passages like Shall I die? with longer ones -- but we think the ones we have discussed should be enough to make our case. We don't pretend to have said the last word on this subject, and, as always, we invite others to try different tradeoffs than the ones we used, but nothing is either good or bad but alternatives make it so, and neither lengthening the passages nor lengthening the test strikes us much of an improvement. VI. Identification Hits. As with our student panel, most of our SHAKSPER answers to each question, Shakespeare or non-Shakespeare, right or wrong, showed very high intra-group agreement as to whether or not the passage was by Shakespeare, and also showed high agreement between the whole group and the Rated group. No more than 7% of the aggregated answers look like tossups. The other 93-96% show majorities of 57% or more. If numbers like these were reported in a national election, everyone would consider it a landslide (Table III). I call this consensus; Valenza calls it convergence. Table III. Group Consensus: Very High, but Not Always Correct All figures net accuracy Shakespeare Non-Shakespeare All High consensus, questions answered correctly Full panel 9 (68-80% maj) 11 (59-100% maj) 20 (59-100% maj) Rated only 11 (64-88% maj) 12 (57-100% maj) 23 (57-100% maj) High consensus, questions answered incorrectly Full panel 3 (64-67% maj) 3 (57-69% maj) 6 (57-69% maj) Rated only 2 (73-78% maj) 2 (57-58% maj) 4 (57-78% maj) Tossups, all incorrect Full panel 2 (51% maj) 0 2 (51%) Rated only 1 (53% maj 0 1 (53%) No tossups were correct This means that both panels had high consensus on 26 or 27 out of the 28 questions and were closely divided on only one or two. Looking at high-consensus answers only, the full panel got 20 of 26 (77%) firmly right, in gross, and the other six firmly wrong. The Rated panel got 23 of 27 firmly right (85%) and the other four firmly wrong. We'll skip the details of the impressive Shakespeare "firmly rights" and go straight to the equally impressive Non-Shakespeare "firmly rights." Table IV shows that neither panel had much trouble with most non-Shakespeare authors represented. Table IV. Eleven Non-Shakespeare Hits Passage Percentages who thought it non-Shakespeare (full panel/rated only) Listed in declining order of Rated percentages. All percentages net. Bacon poems 87/100% Middleton 89/100% Chapman 82/100% Spenser 78/96% Fletcher 67/91% Daniel 75/87% Marlowe II 71/83% Shall I Die? 65/82% Earl of Oxford 59/79% Marlowe I 70/78% Jonson 60/72% All of these seem like solid hits to us, both according to what we see as the orthodox consensus and according to what our computer evidence has done to confirm it. None of these tested passages seem likely to be Shakespeare's. Not everyone agrees with us or the orthodox consensus on every passage, but the important point here is that remarkably few of our test-takers thought these passages sounded like Shakespeare. We would hardly consider numbers like these a humbling outcome for the group that produced them. Shall I Die? was the only one of these widely recognized (by 31%/54% of the two panels), but few of those who did not recognize it thought it was Shakespeare's. Would longer passages greatly enhance these landslides? We doubt it; they are already so lopsided it's hard to imagine longer passages changing things much, even if they should be easier to identify. Do they signal that the group is befuddled by too-short passages? It doesn't look like it. VII. Identification Misses. However, two outcomes, though equally convergent and consensual, were not so impressive for accuracy, and a third showed the full group at odds with the Rated group (Table V). Table V. Two and a half Non-Shakespeare misses Passage Percentages who thought it non-Shakespeare (Full panel/Rated) All percentages net Oldcastle 31/42% Drayton 43/43% Funeral Elegy 41/57% The Oldcastle and Drayton passages, one recalling a beleaguered-stag scene from As You Like It, the other a sonnet from Drayton's Idea, suggest that even strong majorities of both groups can be fooled by well-turned, vivid, image-rich passages by other writers. It is also possible that Drayton, who Henslowe says co-authored the play Sir John Oldcastle (1600) with Anthony Munday, Robert Wilson, and Richard Hathway, could have written both of the confounding passages. A second edition of Oldcastle ascribed it to Shakespeare, and it was included in the 1664 Folio and Brooke's Apocrypha, but we know of no one today who seriously ascribes it to Shakespeare, and our tests say it's very unlikely to be Shakespeare's work (our 2004, p. 402). No takers, incidentally, recognized the Oldcastle passage and only one recognized the one from Idea. What about the Funeral Elegy? Donald Foster relied in part on computer tests to prove that the Elegy "couldn't not be Shakespeare," and he did speak of intuitive "sniff tests" with a hint of disdain. When Brian Vickers' crushing counter case, Counterfeiting Shakespeare (2002) loomed, and Foster abandoned his Shakespeare ascription, the dull, pious, pedestrian Elegy of the eye instantly became Exhibit A for those who say you should always trust your gut instincts, never anyone's computers. If so, and if the whole SHAKSPER group's intuitions were to be taken as the only valid test, Foster should have stuck to his guns on authorship and reconsidered his disdain for sniff tests. Only 6% of the whole panel recognized our Elegy passage, and 59% of those who didn't thought it was Shakespeare's! On the other hand, a net 57% of the Rated Panel thought it was not Shakespeare. Our tests say that Foster did the right thing to concede, and that the Elegy is on a different statistical planet from Shakespeare, though it could easily be by Ford (our 2001). Maybe it's only the best ears we are supposed to listen to, but it's not easy to know in advance which ones those are on any given passage, and they are not always connected to the best mouths. On balance, the conflict between rated ears and all ears does little to make the Elegy look like a reliable success story for detection by gut instinct. Table VI shows two Shakespeare misses for both panels, and two equivocal tossups. Table VI. Two Shakespeare misses and two more divided panels Passage Percentages who thought it Shakespeare (Full panel/Rated) All percentages net The Rape of Lucrece 38/22% Pericles Act V 33/47% Love's Labor's Lost 49/64% Venus and Adonis 49/59% Only one taker recognized our passage from The Rape of Lucrece, and very few of the others thought it was Shakespeare's, fewest of all, oddly, on the Rated panel, which was otherwise generally more accurate than the whole group. Seven takers recognized our passage from Pericles, Act. V, but two-thirds of the whole panel, and 53% of the Rated panel, thought it was not Shakespeare's. Pericles is generally considered co-authored by Shakespeare and George Wilkins; scholarly consensus gives Acts 3-5 to Shakespeare, and our tests agree with it. The passages from Love's Labor's Lost and Venus and Adonis were recognized by 19 and three takers, respectively; for the others, the whole group was divided half and half, but clear majorities of the Rated group correctly ascribed them both to Shakespeare. It's not clear how to score these. Two misses and two tossups? Or two misses and two half-hits? Neither seems an unequivocal success story for these identifications. VIII. Three shots in the dark. Three passages on the test were not scored, since scholarly consensus as to who wrote them is not settled. But we tested them anyway in case we found the group's instincts helpful in determining actual ascriptions. This is wholly uncharted territory, but, if we had a computer test that looked like it might be 80% accurate, we might not bet a thousand pounds on it, as we have on some of our computer tests, but we certainly would not want to let it sit on the shelf unexplored. The same may be said for ascription by gut instinct. With tweaking, it can reach 82% group net accuracy for passages of known authorship, and we can't imagine SHAKSPERians not being curious as to what it says about passages of unsettled authorship. Table VII gives the outcomes: Table VII. SHAKSPER's Group Ascriptions for Three Doubtful Passages Passage Percentages who thought it Shakespeare (Full panel/Rated) All percentages net 1H6 87/89% A Lover's Complaint 42/26% Edward III 68/73% 14% and 25% of the panels recognized another beleaguered stag scene from 1H6, Talbot before Bordeaux. 87/89% of those who did not recognize it thought it was Shakespeare, one of the most lopsided majorities on the test. Gary Taylor assigns the scene, 4.02, to Shakespeare; Paul Vincent thinks it is co-authored by Shakespeare and "Author Y." Marcus Dahl could find no hand but Shakespeare's in the whole play. Our own tests are ambivalent on the scene as a whole. It looks like a Shakespeare could-be by all our regular tests, but an improbable by one new test. The passage itself is much too short for our tests. We lean toward Vincent's view of the whole scene, but the stylometric evidence is hard to judge - much harder, it seems than the intuitive evidence. SHAKSPER's judgment in this case is consistent with all four views of 1H6, though one could argue that SHAKSPER's judgment on non-Shakespeare beleaguered-stag passages is not terribly reliable . Perhaps surprisingly, only one person recognized the passage from A Lover's Complaint. Could it be more discussed these days than read? Of the many who did not recognize it, 58/74% thought it was not Shakespeare. MacDonald Jackson, Kenneth Muir, and most scholars of the late twentieth century have assigned LC to Shakespeare. Our best guess (our 1997 and 2004), and Brian Vickers'(his 2007), and Marina Tarlinskaja's is that it is not. SHAKSPER's judgment favors the doubters, but, again, one could argue that SHAKSPER's group judgments on Shakespeare poems outside the Sonnets don't look like its most reliable. Five and eight percent of the two panels recognized the Countess Scene passage from Edward III, which we take to be a recent addition to the consensus Canon. Our tests say it could be Shakespeare, and 68/73% of the two panels seem to agree. IX. How SHAKSPER's Round 1 compares with other takers. If we were to seek a kind of "control group" to compare with SHAKSPER's takers, we would turn to several Claremont student groups who have taken the test, or a precursor, or we would look to our computer tests. Four student groups have taken this, or a previous test: Valenza's preceptorial class of entering freshmen in 1995; the Claremont Rugby Side on a bus in New Zealand in 2002; Ann Meyer's Class on Shakespeare's Tragedies in 2002; and a sprinkling of Philosophy, Politics, and Economics alumni volunteers in 2007. We have also, over the years, identified a couple of miscellaneous Golden Ears from casual takers, and we have gone for advice to a small group of pros, most notably MacDonald Jackson and John Farrell, but also Brian Vickers, Lisa Hopkins and Matt Steggle, none of whom should be deemed in any way responsible for any of our test's shortcomings . None of our tested groups got the extensive recordkeeping and analysis that we have given the SHAKSPER groups, but it is safe to say from memory that SHAKSPER outperformed the preceptorials, the Rugby Side, and the PPE students. Whether it outperformed the Claremont Shakespeare class is not so clear. We no longer have the students' individual scores, nor their individual averages, but they were the least casual, and the most-studied of our pilot groups. Here is an adaptation of a posting we sent to SHAKSPER in 2004: Table VIII. A Gross-Score Comparison of SHAKSPER with Claremont Pilot Group Claremont Students SHAKSPER 2002 2007 Worst individual: 54% correct, gross 50% correct, gross Best individual 79% correct, gross 89% gross Best combined 84% correct, gross, n=6 93% correct, gross, n=24 All combined 89% correct, gross, n=12 79% correct, gross, n=80 18 of the student group's 25 successful identifications (72%) were by lopsided votes, two-to-one or higher. 16 of the whole SHAKSPER group's 22 successful identifications (73%) were lopsided in this sense, as were 20 of the Rated group's 24 successful identifications (83%). A better comparison, since SHAKSPERians, on average, recognized about a tenth of the passages and the students next to none, would be between the students' gross scores, essentially equivalent to their net scores, and SHAKSPER's net scores. Such a comparison would look like Table IX: Table IX. Net-Score Comparison of SHAKSPER with Claremont Pilot Group Claremont Students SHAKSPER 2002 2007 Worst individual: 54% correct, net 44% correct, net Best individual 79% correct, net 87% correct, net Best combined 84% correct, net, n=6 82%, correct, net, n=24 All combined 89% correct, net, n=12 79%, correct, net, n=80 It would not be surprising if a group of 80 got higher-highest and lower-lowest scores than a group of 12. It's mildly surprising that the smaller, far less knowledgeable, less motivated group got higher group accuracy than the larger, more knowledgeable, more motivated group, but the smaller the group, the more likely is its outcome to be a fluke. Recall, for example, that the Shakespeare class did much better than the other three Claremont student groups tested. Could that have been a fluke? Perhaps it was also mildly surprising that the whole of the student group, aggregated, did better than the top half, aggregated, which, in turn, did better than the best individual. For the students, aggregation was a bigger boost to accuracy than screening. If so, the reverse SHAKSPER outcome, where the best individual surpassed the best group, which surpassed the whole group, would not be a surprise. For SHAKSPER, screening seems to have been a more powerful tweak than aggregation. There must be a literature on this, perhaps to be gleaned from James Surowiecki's footnotes, but we have not studied it. For now, we would say that the two sets of results are remarkably similar and we would guess, where they differ, that the larger SHAKSPER panels give a better notion than the student panel of what you can reasonably expect do with intuition and what you cannot. It would be possible, with more manual counting than we want to do now, to compare the net accuracy of Shakespeare pros with those of amateurs, and to see whether literary critics did any better or worse than stage performers, artists, scientists, and so on. It's all buried in the data and retrievable in principle, but it would not be easy, and it is not at the top of our agenda. We would guess from the similar accuracy levels of SHAKSPER and the student group, and especially from the surprisingly similar gross accuracy levels of SHAKSPER's own pro and amateur respondents, 68% and 66%, respectively, that the difference in pros' and amateurs' net accuracy, if any, would be barely detectable, and not necessarily favorable to the pros, whom we would expect to recognize more passages than amateurs. Table X also shows a remarkably narrow range of average gross accuracy scores for the various subgroups identified: Table X. Gross Accuracy Scores of Identified Subgroups Subgroup Number Gross Average Gross Average Accuracy Score % Professionals 31 18.9 67.5 Amateurs 49 18.6 66.4 Critics 26 18.7 66.8 Writers 14 18.9 67.5 Artists 33 18.6 66.4 Other 21 18.0 64.3 Some of the categories overlap. "Other" is mostly people who declined to state a category. Would net accuracy differ greatly from these gross accuracy scores? We don't know, but it seems improbable. One might imagine that writers and artists would be more intuitive, and critics more analytical (see Simonton, Origins of Genius, 1999), but the average accuracy of the three categories is virtually identical. We did not ask anyone to list their IQ's or their verbal and math SAT scores, considering it too nosy and off-putting even for us, but we would have loved to have had them, and maybe their college majors as well. Four of our best six students in the 2002 pilot study were science majors; the two best of all were science majors from Harvard-bright Harvey Mudd College; the others were from our college, Claremont McKenna College, whose students, on average, are only Columbia-bright. We have to wonder whether any of our SHAKSPER takers were scientists. Another way of looking at such questions on a smaller scale might be to look at the top end only, Golden and Silver Ears, where we would expect recognition to be at its greatest. Conveniently, there are seven Golden Ears, six of them pros, and six Silver Ears, five of them amateurs. The 86%-pro Golden Ears said they recognized, on average, a remarkable 27% of all questions; the 83%-amateur Silver Ears claimed to have recognized 14% of all questions, half the Golden Ears' rate, but more that twice the rate of the other five-sixths of the group, which was 6%. Golden and Silver ears are only a sixth of the whole group, but the big difference between the two top groups, one mostly pro, the other mostly amateur, is fully consistent with our commonsense guess that pros would recognize more passages than amateurs. If so, we can roughly calculate the two top groups' net accuracy, as explained in Section IV above, simply by subtracting recognitions from both correct answers and total takes of each question, giving us net correct answers as a fraction of net takes. This cuts average Golden-Ear accuracy from 87% gross to 81% net, and Silver-Ear accuracy from 79% gross to 75% net. With recognized passages removed, both groups got fewer passages right in fewer questions, with the high-recognition Golden Ears, not surprisingly, losing more accuracy than the lower-recognition Silver Ears. Netting for recognition, in effect, reduces Golden Ears to Silver, and Silver to Bronze, with the top pros, on average, still retaining a 6% detection accuracy edge over the top amateurs, with 81% net accuracy, versus 75%. Their gross edge had been 8%, 87% versus 79%. Suppose we sought an aggregated, majority-rule on each question group score for the Golden Ears only? It's not clear that they would score much higher than the rated group as a whole (82%), which, let us recall, was only three points higher than the whole group (79%). All 7 Golden Ears recognized four of the 28 passages. For the remaining 24 passages, where there was some unrecognition to be had, there was a net correct majority on 20, a net incorrect majority on three, and a 50-50 tossup on one. If you put aside the tossup altogether, that would give the Golden Ears 20 right out of 23 (87%). If you give half-credit for the tossup, they get 20.5 right out of 24 (85%). If you count the tossup, but give no credit for it, they get 20 right out of 24 (83%). Any of these would be arguable, but we would consider the most conservative of them, 83%, the most defensible. We would conclude from this that the Golden Ears were far ahead of all others in recognition of passages, and 6% better than the Silver Ears in net intuitive detections of unrecognized passages, but horizontal aggregation doesn't seem to boost group accuracy as much at the highest level as it does for the whole group. Unlike the Claremont students in the pilot study, the best individuals in the SHAKSPER group did better than the whole group aggregated, and better than the best of the group aggregated. Several of these individuals also did better than the best Claremont student, even after netting out recognized passages. Net group accuracy for Golden Ears could be five out of six (83%), but it would take a bit of indulgence to get it to six out of seven (86%). Top pros look better than top amateurs at recognizing passages, and it is probable that this is also so of all pros, through we haven't tested for it. We have very little evidence that any category of taker surpasses the others on average. X. How the SHAKSPER group compares in accuracy with stylometric tests. Here is what we said about the Claremont pilot study in 2004: How does this accuracy compare with that of our best quantitative tests? "Far higher" would be a persuasive answer for such short, Sonnet-length samples. All of our quantitative tests are sensitive to sample length because longer samples average out more variance than shorter ones, giving us tighter ranges and higher discrimination for long samples than for short. Most of the samples we used in our Golden Ear test have no more than 150 words, far shorter than any for which we have dared to validate any of our quantitative tests. For comparison, our current estimated composite accuracy rates for longer, single-authored passages look something like this: Text Shakespeare Non-Shakespeare Whole plays 100% 100% Poems, 3000 words 100% 100% Play Verse, 3000 words 95% 100% Poems, 1500 words 100% 100% Play Verse, 1500 words 96% 88% Poems 750 words 93% 71% Play Verse 750 words 97% 75% Poems, 470 words 92% 73% Not much has changed since then. Our accuracy figures remain the same, and the SHAKSPER Golden Ear outcomes are similar to those from our student pilot study, only slightly higher for the best individuals, but somewhat lower for the group. SHAKSPER's double-tweaked accuracy from intuition alone is not as good as our tests have been on samples of 1,500 words or more, but it's in the same ball park with our accuracy for passages of 470 to 750 words - and it is far higher than we would expect any or all of our tests to do on the very short passages tested, which averaged about 140 words. If Golden Ear had been tried as a stylometric test, it probably would not quite have met our test criteria - of around 95% reliability in saying "could be" to known Shakespeare, and at least 20% reliability is saying "couldn't be" to known non-Shakespeare, but that is because we rely on negative evidence and are far less tolerant of false negatives than of false positives. Without question, intuition is far better than all our tests combined on the sonnet-length passages we tested. As we explained in Section V, it is conceivable that using longer passages would raise Golden-Ear accuracy, but doubtful that anyone could devise practicable intuitive tests for, say, the 1,500-word or 3,000-word passages for which we consider our stylometric tests to be well validated. If we, or anyone else, could find and offer 28 3,000-word passages for identification, the test would be equal in length to Hamlet, Macbeth, Romeo and Juliet, and The Comedy of Errors combined and would take more than a day just to read, let alone analyze, in entirety. In general, the longer the passages, the fewer can be tested without expecting miracles of motivation. From this perspective, Golden-Ear testing may be almost as impractical for wholesale testing of long passages as computers are for testing short passages. XI. Conclusions In sum, after much tweaking and netting, SHAKSPERians as a group seem capable of getting almost four out of five, or five out of six, identifications right, and its very best individual did a bit better than that, with net accuracy reaching 87%. However, only two of 80 test takers got net accuracy higher than 83%. The average Golden Ear had 80% net accuracy, silver 75%; the average individual SHAKSPERian taker had 63% net accuracy. SHAKSPER's overall performance roughly matched the best of our student pilot groups, with aggregated group accuracy slightly lower, and the accuracy of the very top individuals somewhat higher. SHAKSPER's recognition rates were much higher than those of the Claremont pilot group, and its Golden Ears' rates much higher than the rest of SHAKSPER. No one else has taken the test as seriously as SHAKSPER. None of the subcategories of takers stood out as much better or worse than the others, and the differences in gross accuracy between pros and amateurs seem remarkably small. Intuition seems much more accurate than stylometrics for very short, sonnet-length passages; stylometric seems more accurate than intuition for longer passages, but an actual head-to-head comparison of many long passages seems impracticable. XII. Golden Ear Round 2 Golden Ear Round 1 has given us what looks like a highly talented panel of a dozen rated, identified SHAKSPERians, to take Golden Ear Round 2. We haven't yet asked them whether any of them want their names or their particulars, pro/amateur, writer, player, etc., disclosed to SHAKSPER or anyone else. To these we might add up to eight or nine rated players discovered in previous tests, if we can find them and get them to serve. A couple of unrated SHAKSPERians want to take Round 2, and we would not grudge them the experience, nor would we be above retroactively including some or all of the dozen rated players who did not identify themselves on the test, or, for that matter, other unrated players who want another shot, but we need enough identification on Round 2 to e-mail it to the proper recipients and to relate their Round 2 outcomes to their Round 1 outcomes. Unfortunately, we don't have Round 2 on the web and shall have to find a way to mail or e-mail it to takers to do by hand, and to be scored by hand. Like Round 1, Round 2 will have a known, scorable component, both to help calibrate the test, which we would guess is more difficult than Round 1, and to provide more data points to help see how much, if any, of the high accuracy rates found at the top of Round 1 was luck of the draw. It's one thing to find and congratulate the best guesser of the number of beans in the Business School jar; it is quite another to expect him to do it twice. Round 2 will also have some of its own shots in the dark, passages whose authorship is not settled. XIII. A last note on methodology In some ways, it is astonishing, given the frequency and fervency of declarations that intuition can outperform Stylometry - or that styometry can outperform "sniff tests" -- that no one has ever tried to see whether, and to what extent either proposition is actually so. We have been trying to remedy that for twelve years and have now, at last, thanks to the help of our student programmer, Ryan Wilson, and his advisor, Arthur Lee, gotten a Round 1 survey up on the net and gotten an excellent response from SHAKSPER, which has permitted us, at last, to try a first cut at an answer. We have used the best methodology we could manage, but who are we to proclaim that the tradeoffs we chose in our first big outing are the ones that should bind all others? For an exercise as novel as this, it would be surprising if further experimentation with different parameters did not produce new insights perhaps wiser and more penetrating than ours, and informed by our mistakes, as well as by our successes. Every adventure is a reconnaissance for the next, and this seems to us a question begging to be explored from more than one perspective. As always, if anyone in or out of SHAKSPER would like to try an experiment with different tradeoffs than the ones we chose, and SHAKSPERians were willing to take it, we would be happy to help them out. In the meantime, we consider our tradeoffs reasonable ones and our evidence the best currently available. We hope it will inspire better. In the meantime, we would like, again, to thank the SHAKSPERians and others who took our survey for giving it their full attention, and especially for honoring our request to withhold their online comments till the test was over, so as not to wreck the test for others. We now welcome comments, online and off, but we hope the online ones will take care not to give away too many specifics of the test, which cost twelve years of pretesting and hundreds of dollars for programming to prepare, and, now, many hours from SHAKSPERians to take and us to analyze. We would just as soon keep it available for future use with different groups, and as a standard against which other versions can be measured. We hope that SHAKSPERians will help us keep it so, as much as possible. welliott@mcknenna.edu (the address, not the professor) was retired July 1, 2007; please use welliott@cmc.edu instead. Ward Elliott Burnet C. Wohlford Professor of American Political Institutions Claremont McKenna College Pitzer Hall, 850 Columbia Ave. Claremont, CA 91711-6420 (909) 607-3649 Fax (909) 621-8419 welliott@cmc.edu http://govt.cmc.edu/welliott "Better grey words with crimson examples than crimson words with grey examples." _______________________________________________________________ S H A K S P E R: The Global Shakespeare Discussion List Hardy M. Cook, editor@shaksper.net The S H A K S P E R Web Site <http://www.shaksper.net> DISCLAIMER: Although SHAKSPER is a moderated discussion list, the opinions expressed on it are the sole property of the poster, and the editor assumes no responsibility for them.
|
|
|||||