Sadly, it’s not particularly surprising that it took a proclamation by researchers from prominent institutions (Harvard and MIT) to get the media’s attention to what should have been obvious all along. That they don’t have alternative metrics handy highlights the difficulties of assessment in the absence of high-quality data both inside and outside the system. Inside the system, designers of online courses are still figuring out how to assess knowledge and learning quickly and effectively. Outside the system, would-be analysts lack information on how students (graduates and drop-outs alike) make use of what they learned– or not. Measuring long-term retention and far transfer will continue to pose a problem for evaluating educational experiences as they become more modularized and unbundled, unless systems emerge for integrating outcome data across experiences and over time. In economic terms, it exemplifies the need to internalize the externalities to the system.
One metric for evaluating automated scoring is to compare it against human scoring. For some domains and test formats (e.g., multiple-choice items on factual knowledge), automation has an accepted advantage in objectivity and reliability, although whether such questions assess meaningful understanding is often debated. With more open-ended domains and designs, human reading is typically considered superior, allowing room for individual nuance to shine through and get recognized.
Yet this exposé of some professional scorers’ experience reveals how even that cherished human judgment can get distorted and devalued. Here, narrow rubrics, mandated consistency, and expectations of bell curves valued sameness over subtlety and efficiency over reflection. In essence, such simplistic algorithms resulted in reverse-engineering cookie-cutter essays that all had to fit one of their six categories, differing details be damned.
Individual algorithms and procedures for assessing tests need to be improved so that they can make better use of a broader base of information. So does a system which relies so heavily on particular assessments that the impact of their weaknesses can get magnified so greatly. Teachers and schools collect a wealth of assessment data all the time; better mechanisms for aggregating and analyzing these data can extract more informational value from them and decrease the disproportionate weight on testing factories. When designed well, algorithms and automated tools for assessment can enhance human judgment rather than reducing it to an arbitrary bin-sorting exercise.
Some historical context on how standardized tests have affected the elite points out how gatekeepers can magnify the influence of certain factors over others– whether through chance or through bias:
In 1947, the three significant testing organizations, the College Entrance Examination Board, the Carnegie Foundation for the Advancement of Teaching and the American Council on Education, merged their testing divisions into the Educational Testing Service, which was headed by former Harvard Dean Henry Chauncey.
Chauncey was greatly affected by a 1948 Scientific Monthly article, “The Measurement of Mental Systems (Can Intelligence Be Measured?)” by W. Allison Davis and Robert J. Havighurst, which called intelligence tests nothing more than a scientific way to give preference to children from middle- and upper-middle-class families. The article challenged Chauncey’s belief that by expanding standardized tests of mental ability and knowledge America’s colleges would become the vanguard of a new meritocracy of intellect, ability and ambition, and not finishing schools for the privileged.
The authors, and others, challenged that the tests were biased. Challenges aside, the proponents of widespread standardized testing were instrumental in the process of who crossed the American economic divide, as college graduates became the country’s economic winners in the postwar era.
As Nicholas Lemann wrote in his book “The Big Test,” “The machinery that (Harvard President James) Conant and Chauncey and their allies created is today so familiar and all-encompassing that its seems almost like a natural phenomenon, or at least an organism that evolved spontaneously in response to conditions. … It’s not.”
As a New Mexico elementary teacher and blogger explains:
My point is that test scores have a lot of IMPACT because of the graduation requirements, even if they don’t always have a lot of VALUE as a measure of growth.
Instead of grade inflation, we have testing-influence inflation, where the impact of certain tests is magnified beyond that of other assessment metrics. It becomes a kind of market distortion in the economics of test scores, where some measurements are more visible and assume more value than others, inviting cheating and “gaming the system“.
We can restore openness and transparency to the system by collecting continuous assessment data that assign more equal weight across a wider range of testing experiences, removing incentives to cheat or “teach to the test”. Adaptive and personalized assessment go further in alleviating pressures to cheat, by reducing the inflated number of competitors against whom one may be compared. Assessment can then return to fulfilling its intended role of providing useful information on what a student has learned, thereby yielding better measures of growth and becoming more honestly meritocratic.
From Peter Nonacs, UCLA professor teaching Behavioral Ecology:
Tests are really just measures of how the Education Game is proceeding. Professors test to measure their success at teaching, and students take tests in order to get a good grade. Might these goals be maximized simultaneously? What if I let the students write their own rules for the test-taking game? Allow them to do everything we would normally call cheating?
And in a new MOOC titled “Understanding Cheating in Online Courses,” taught by Bernard Bull at Concordia University Wisconsin:
The start of the course will cover the basic vocabulary and different types of cheating. The course will then move into discussing the differences between online and face-to-face learning, and the philosophy and psychology behind academic integrity. One unit will examine the best practices to minimize cheating.
Cheating crops up whenever there is a mismatch between effort and reward, something which happens often in our current educational system. Assigning unequal rewards to equal efforts biases attention toward the inflated reward, motivating cheating. Assigning equal rewards to unequal efforts favors the lesser effort, enabling cheating. The greater the disparities, the greater the likelihood of cheating.
Thus, one potential avenue for reducing cheating would be to better align the reward to the effort, to link the evaluation of outputs more closely to the actual inputs. High-stakes tests separate them by exaggerating the influence of a single, limited snapshot. In contrast, continuous, passive assessment brings them closer by examining a much broader range of work over time, collected in authentic learning contexts rather than artificial testing situations. Education then becomes a series of honest learning experiences, rather than an arbitrary system to game.
In an era where students learn what gets assessed, the answer may be to assess everything.
What is university for? I ask this old question because the utilitarian answer which was especially popular in the New Labour years – that the economy needs more graduates – might be becoming less plausible. A new paper by Paul Beaudry and colleagues says (pdf) there has been a “great reversal” in the demand for high cognitive skills in the US since around 2000, and the BLS forecasts that the fastest-growing occupations between now and 2020 will be mostly traditionally non-graduate ones, such as care assistants, fast food workers and truck drivers; Allister Heath thinks a similar thing might be true for the UK.
Nevertheless,we should ask: what function would universities serve in an economy where demand for higher cognitive skills is declining? There are many possibilities:
– Network effects. University teaches you to associate with the sort of people who might have good jobs in future, and might give you the contacts to get such jobs later.
– A lottery ticket.A degree doesn’t guarantee getting a good job. But without one, you have no chance.
– Flexibility. A graduate can stack shelves, and might be more attractive as a shelf-stacker than a non-graduate. Beaudry and colleagues decribe how the falling demand for graduates has caused graduates to displace non-graduates in less skilled jobs.
– Maturation & hidden unemployment. 21-year-olds are more employable than 18-year-olds, simply because they are three years less foolish. In this sense, university lets people pass time without showing up in the unemployment data.
– Consumption benefits. University is a less unpleasant way of spending three years than work. And it can provide a stock of consumption capital which improves the quality of our future leisure. By far the most important thing I learnt at Oxford was a love of Hank Williams and Leonard Cohen.
As the signaling function of the degree falls, we should consider how the signaling power of certificates, competencies, and other innovations may rise to overtake it. With specific knowledge and skills unbundled from each other, these markers may be more responsive to actual demand. More specific assessment metrics can help stakeholders better evaluate different programs of study, while more flexible learning paths can help students more efficiently pursue the knowledge and skills that will be most valuable to them.
Criticisms of high-stakes tests abound as we usher in the start of K-12 testing season. Students worry about being judged on a bad day and note that tests measure only one kind of success, while teachers lament the narrowing of the curriculum. Others object to the lack of transparency in a system entrusted with such great influence.
Yet the problem isn’t tests themselves, but relying on only a few tests. What we actually need is more information, not less. Ongoing assessment collected from multiple opportunities, in varied contexts, and across time can help shield any one datapoint from receiving undue weight.
Personalized assessment goes further in acknowledging the difference between standardization in measurement (valuable) and uniformity in testing (unhelpful). Students with different goals deserve to be assessed by different standards and methods, and not arbitrarily pitted against each other in universal comparisons. Gathering more data from richer contexts that are better matched to students’ learning needs is a fundamental tenet of personalization.