On standardization and personalization in education

At the heart of developing personalized learning systems are critical design and policy questions about what to personalize and what to standardize. This working paper describes key dimensions to consider in finding the right balance between consistency and flexibility.

Advertisements

An economic argument for personalized learning

As described on The Economist’s “Schumpeter” blog, economic growth depends on innovation and more flexible job preparation:

Entrepreneurs repeatedly complain that they cannot hire the right people because universities are failing to keep pace with a fast-changing job market. Small firms lack the resources to provide training and are consequently making do with fewer people working longer hours.

The claim that “established firms are usually in the business of preserving the old world” bears interesting parallels to universities, which tend to replicate their own status quo. Instead of producing numerous graduates of the programs of yesteryear, universities need to update their training to develop knowledge and skills in current demand. Adapting to these demands through lengthy committee review and accreditation requirements is unlikely to be fast enough for the “agile-development” expectations of today’s startup culture. Educational institutions thus need new processes for tailoring programs of study to modern demands with both integrity and efficiency.

An even stronger motivation for allowing students to tailor their own course of study to their particular needs is that employers seek teams of people with a mix of complementary skills, not multiple copies of people with the same skill set. Instead of trying to differentiate candidates on some imagined basis of unidimensional merit, employers need to differentiate them along multiple dimensions of value to their particular needs. Employers are constantly talking about “fit”; educational institutions should facilitate discovery of a good fit by using personalized assessment, to provide richer information about how a candidate’s unique strengths and experiences may match a particular profile of needs.

Standardized tests as market distortions

Some historical context on how standardized tests have affected the elite points out how gatekeepers can magnify the influence of certain factors over others– whether through chance or through bias:

In 1947, the three significant testing organizations, the College Entrance Examination Board, the Carnegie Foundation for the Advancement of Teaching and the American Council on Education, merged their testing divisions into the Educational Testing Service, which was headed by former Harvard Dean Henry Chauncey.

Chauncey was greatly affected by a 1948 Scientific Monthly article, “The Measurement of Mental Systems (Can Intelligence Be Measured?)” by W. Allison Davis and Robert J. Havighurst, which called intelligence tests nothing more than a scientific way to give preference to children from middle- and upper-middle-class families. The article challenged Chauncey’s belief that by expanding standardized tests of mental ability and knowledge America’s colleges would become the vanguard of a new meritocracy of intellect, ability and ambition, and not finishing schools for the privileged.

The authors, and others, challenged that the tests were biased. Challenges aside, the proponents of widespread standardized testing were instrumental in the process of who crossed the American economic divide, as college graduates became the country’s economic winners in the postwar era.

As Nicholas Lemann wrote in his book “The Big Test,” “The machinery that (Harvard President James) Conant and Chauncey and their allies created is today so familiar and all-encompassing that its seems almost like a natural phenomenon, or at least an organism that evolved spontaneously in response to conditions. … It’s not.”

As a New Mexico elementary teacher and blogger explains:

My point is that test scores have a lot of IMPACT because of the graduation requirements, even if they don’t always have a lot of VALUE as a measure of growth.

Instead of grade inflation, we have testing-influence inflation, where the impact of certain tests is magnified beyond that of other assessment metrics. It becomes a kind of market distortion in the economics of test scores, where some measurements are more visible and assume more value than others, inviting cheating and “gaming the system“.

We can restore openness and transparency to the system by collecting continuous assessment data that assign more equal weight across a wider range of testing experiences, removing incentives to cheat or “teach to the test”. Adaptive and personalized assessment go further in alleviating pressures to cheat, by reducing the inflated number of competitors against whom one may be compared. Assessment can then return to fulfilling its intended role of providing useful information on what a student has learned, thereby yielding better measures of growth and becoming more honestly meritocratic.

Unpacking degrees

Chris Dillow questions the purpose and value of a university degree (linked from Observational Epidemiology):

What is university for? I ask this old question because the utilitarian answer which was especially popular in the New Labour years – that the economy needs more graduates – might be becoming less plausible. A new paper by Paul Beaudry and colleagues says (pdf) there has been a “great reversal” in the demand for high cognitive skills in the US since around 2000, and the BLS forecasts that the fastest-growing occupations between now and 2020 will be mostly traditionally non-graduate ones, such as care assistants, fast food workers and truck drivers; Allister Heath thinks a similar thing might be true for the UK.

Nevertheless,we should ask: what function would universities serve in an economy where demand for higher cognitive skills is declining? There are many possibilities:

– A signaling device. A degree tells prospective employers that its holder is intelligent, hard-working and moderately conventional – all attractive qualities.

– Network effects. University teaches you to associate with the sort of people who might have good jobs in future, and might give you the contacts to get such jobs later.

– A lottery ticket.A degree doesn’t guarantee getting a good job. But without one, you have no chance.

– Flexibility. A graduate can stack shelves, and might be more attractive as a shelf-stacker than a non-graduate. Beaudry and colleagues decribe how the falling demand for graduates has caused graduates to displace non-graduates in less skilled jobs.

– Maturation & hidden unemployment. 21-year-olds are more employable than 18-year-olds, simply because they are three years less foolish. In this sense, university lets people pass time without showing up in the unemployment data.

– Consumption benefits. University is a less unpleasant way of spending three years than work. And it can provide a stock of consumption capital which improves the quality of our future leisure. By far the most important thing I learnt at Oxford was a love of Hank Williams and Leonard Cohen.

As the signaling function of the degree falls, we should consider how the signaling power of certificates, competencies, and other innovations may rise to overtake it. With specific knowledge and skills unbundled from each other, these markers may be more responsive to actual demand. More specific assessment metrics can help stakeholders better evaluate different programs of study, while more flexible learning paths can help students more efficiently pursue the knowledge and skills that will be most valuable to them.

Using personalized assessment to change the high-stakes testing culture

Criticisms of high-stakes tests abound as we usher in the start of K-12 testing season. Students worry about being judged on a bad day and note that tests measure only one kind of success, while teachers lament the narrowing of the curriculum. Others object to the lack of transparency in a system entrusted with such great influence.

Yet the problem isn’t tests themselves, but relying on only a few tests. What we actually need is more information, not less. Ongoing assessment collected from multiple opportunities, in varied contexts, and across time can help shield any one datapoint from receiving undue weight.

Personalized assessment goes further in acknowledging the difference between standardization in measurement (valuable) and uniformity in testing (unhelpful). Students with different goals deserve to be assessed by different standards and methods, and not arbitrarily pitted against each other in universal comparisons. Gathering more data from richer contexts that are better matched to students’ learning needs is a fundamental tenet of personalization.

Freedom and guidance in competency-based education

According to Paul Fain:

competency-based education… looks nothing like traditional college classes. Perhaps the method’s most revolutionary, and controversial, contribution is a changed role for faculty. Instructors don’t teach, because there are no lectures or any other guided path through course material.

Aside from the narrow view of what constitutes “teaching”, this paints only one version of what competency-based education might look like. Competencies refer to the milestones by which stakeholders assess progress, thus constraining the entry and endpoints but not the paths by which those milestones might be reached. Students could all traverse the same path but at their own pace, or they might follow any of a finite set of well-defined trajectories prescribed by instructional designers. They could also be free to chart their own course through open terrain, whether advised by a personal guide or a generic tour book, perhaps even with prerecorded audio or video highlighting landmarks. Recommended or mandated paths can then be tailored to students’ needs, experiences, and preferences. The extra degrees of freedom mean that competency-based education actually has the potential to enable much more personalized guidance than traditional time-based formats.

What should we assess?

Some thoughts on what tests should measure, from Justin Minkel:

Harvard education scholar Tony Wagner was quoted in a recent op-ed piece by Thomas Friedman on what we should be measuring instead: “Because knowledge is available on every Internet-connected device, what you know matters far less than what you can do with what you know. The capacity to innovate—the ability to solve problems creatively or bring new possibilities to life—and skills like critical thinking, communication and collaboration are far more important than academic knowledge.”

Can we measure these things that matter? I think we can. It’s harder to measure critical thinking and innovation than it is to measure basic skills. Harder but not impossible.

His suggestions:

For starters, we need to make sure that tests students take meets [sic] three basic criteria:

1. They must measure individual student growth.

2. Questions must be differentiated, so the test captures what students below and above grade-level know and still need to learn.

3. The tests must measures [sic] what matters: critical thinking, ingenuity, collaboration, and real-world problem-solving.

Measuring individual growth and providing differentiated questions are obvious design goals for personalized assessment. The third remains a challenge for assessment design all around.

Automating assessment: How, when, and why?

EdX, the most prominent nonprofit MOOC provider, plans to use and share automated software to grade and give feedback on student essays. On the heels of this announcement come legitimate skepticism about how well computers actually grade student work (i.e., “Can this be done?”) and understandable concern whether this is a worthwhile direction for education to proceed (i.e., “Should this be done?”). Recasting these two questions in terms of how, when, and why to apply automated assessment yields a more critical framework for finding the right balance between machine-intelligent and human-intelligent assessment.

How?

When evaluating the limitations of artificial intelligence, I find it helpful to ask whether they can be classified as issues with data or algorithms. In some cases, available data simply weren’t included in the model, while in others, such data may be prohibitively difficult or expensive to capture. The algorithms contain the details of how data get transformed into predictions and recommendations. They codify what gets weighted more heavily, which factors are assumed to influence each other, and how much.

Limitations of data: Train on broader set of sample student work as inputs

Todd Pettigrew describes some familiar examples of how student work might appear to merit one grade on the surface but another for content:

it is quite common to see essays that are superficially strong — good grammar, rich vocabulary — but lack any real insight… Similarly some very strong essays—with striking originality and deep insight—have a surprising number of technical errors that would likely lead a computer algorithm to conclude it was bad.

This highlights the need to train the model on these edge cases to distinguish between style and substance, and to ensure that it does not false-alarm on spurious features. Elijah Mayfield points out that a training set of only 100 hand-graded essays is inadequate; this is just one example of the kind of information such a small sample could fail to capture adequately.

Limitations of data: Include data from beyond the assignment and the course

Another relevant concern is whether the essay simply paraphrased an idea from another source, or if it included an original contribution. Again from Todd Pettigrew:

the computer cannot possibly know how the students answers have related to what was done elsewhere in the course. Did a student’s answer present an original idea? Or did it just rehash what the prof said in class?

Including other information presented in the course would allow the model to recognize low-level rehashing; adding information from external sources could help situate the essay’s ideas relative to other ideas. A compendium of previously-expressed ideas could also be labeled as normative (consistent with the target concepts to be learned) or non-normative (such as common misconceptions), to better approximate the distance between the “new” idea and “old-but-useful” ideas or “old-but-not-so-useful” ideas. But confirming whether that potentially new idea is a worthwhile insight, a personal digression, or a flawed claim is probably still best left to the human expert, until we have better models for evaluating innovation.

Limitations of data: Optimizing along the wrong output parameters

Scores that were dashed off by harried, overworked graders provide a poor standard for training an AI system. More fundamentally, the essay grade itself is not the goal; it is only a proxy for what we believe the goals of education should be. Robust assessment relies on multiple measures collected over time, across contexts, and corroborated by different raters. If we value long-term retention, transfer, and future learning potential, then our assessment metrics and models should include those.

I recognize that researchers are simply using the data that are most readily available and that have the most face value. My own work sought to predict end-of-course grades as a preliminary proof of concept because that’s the information we consistently have and use, and our society (perhaps grudgingly) accepts that. Ideally, I would prefer different assessment data. In pointing the direction in which we ought to go with such innovations, ultimately we should identify better data (through educators, assessment experts, and learning scientists), make them readily available (through policymakers and data architects), and demand their incorporation in the algorithms and tools used (through data analysts, machine learning specialists, and developers).

Limitations of algorithms: Model for meaning

Predictive or not, features such as essay length, sophistication of vocabulary, sentence complexity, and use of punctuation are typically not the most critical determinants of essay quality. What we care about is content, which demands modeling the conceptual domain. Hierarchical topic models can map the relative conceptual sophistication of an essay, tracking the depth and novelty of a student’s writing. While a simple semantic “bag-of-words” model ignores word order and proximity, a purely syntactic model accepts grammatical gibberish. A combined semantic-syntactic model can capture not just word co-occurrence patterns, but higher-order relations between words and concepts, as evident in sentence and document structure. Compared to the approaches earning such public rebuke now, more sophisticated algorithms exist, although they need more testing on better data.

When?

The question here is which parts of the assessment process are best kept “personalized,” and which parts are best made “adaptive.”

Rapid, automated feedback is useful only if the information can actually be used productively before the manual feedback would have arrived. For a student whose self-assessment is wide of the mark, an immediate grade can offer reassurance or a kick in the pants. For others, it may enable doing just enough to get by. Idealist instructors might shudder at the notion, but students juggling competing demands on their time might welcome the guidance. How well students can make sense of the feedback will depend on its specificity, understandability, actionability, and perceived cost-benefit calculus, all open questions in need of further iteration.

For instructors, rapid feedback can provide a snapshot of aggregate patterns that might otherwise take them hours, days, or longer to develop. Beyond simply highlighting averages which an expert instructor could already have predicted, such snapshots could cluster similar essays that should be read together to ensure consistency of grading. They could flag unusual ideas in individual essays or unexpected patterns across multiple essays for closer attention. Student work could be aggregated for pattern analysis within an individual class, across the history of each student, or across multiple instances of the same class over time. Some forms of contextualization may be overwhelming or even undesirably biasing, while others can promote greater fairness and enable deeper analysis. Determining which information is most worthwhile for an instructor to know during the grading process is thus another important open question.

Both cases explicitly acknowledge the role of the person (either the student or the instructor) in considering how they may interact with the information given to them. An adaptive system would provide first-pass feedback for each user to integrate, but the instructor would still retain responsibility for evaluating student work and developing more sophisticated feedback on it, with both instructor and student continuing the conversation from there— the personalized component.

Essay-grading itself may not be the best application of this technology. It may be more aptly framed as a late-stage writing coach for the student, or an early-stage grading assistant for the instructor. It may be more useful when applied to a larger body of a given student’s work than to an individual assignment. Or it may be more effective for both student and instructor when applied to an online discussion, outlining emerging trends and concerns, highlighting glaring gaps, helping hasty writers revise before submitting, and alerting facilitators when and where to intervene. While scaling up assessment is an acknowledged “pain point” throughout the educational enterprise, automation may fulfill only some of those needs, with other innovations taking over the rest.

Why?

The purpose of technology should be to augment the human experience, not to replace or shortchange it, and education– especially writing– is fundamentally about connecting to other people. Many of the objections to automated essay grading reflect these beliefs, even if not explicitly stated. People question whether automation can capture something which goes far beyond that essay alone: not just the student’s longer learning trajectory, but the sense of a conversation between two people that extends over time, the participation in a meaningful interpersonal relationship. Whether our current instantiation of higher education currently meets this ideal is not the point. Rather, in a world where we can design technology to meet goals of our own choosing, and in which good design is a time-consuming and labor-intensive process, we should align those expensive technologies with worthwhile goals.

By these standards, any assessment of writing which robs the student and author of these extended conversations fundamentally fails. Jane Robbins claims that students need “the guidance of experts with depth and breadth in the field at hand”, a teacher who can also be “mentor, coach, prodder, supervisor.” As the students on the Brown Daily Herald’s editorial board argue:

an evaluation of an essay by a professor is just as important, if not more, to a student’s scholarship and writing. The ability to sit down and discuss the particularities of an essay with another well-informed and logical human is an essential part of the essay writing experience.

Coupled with arguments that other types of (machine-graded) assessment are better for evaluating content knowledge or even low-level critical thinking, these arguments beg the question: Why try to automate assessment of writing at all? After all, much of what I have advocated here is simply accelerating the assessment process, not truly automating it.

The most basic reason is simply that writing instruction is important, and students need ongoing practice and feedback to continue improving. To the extent that any assessment feedback can be effectively automated, it can help support this goal. That raises two additional questions: How deep must feedback be for a writing exercise to be worthwhile? More controversial, can writing that never sees a real audience still serve a legitimate pedagogical purpose?

Considering the benefits not just of actively retrieving and generating information, but also of organizing one’s thoughts into coherent expression, I would argue that some writing exercises can facilitate learning even without an audience. Less clear is how far that collection of “some” stretches, or what the specific parameters are which demand feedback from an expert human. Likely factors include more complex assignments, more extreme work quality, weaker feelings of student belonging, longer intervals between receiving human feedback, less sophisticated automated feedback, and more nuanced expert feedback. Better articulating these limits and anticipating what we can gain and lose will help guide future development and application of automated assessment.

Personalized and adaptive assessment: Placing the stakes

Compared to personalized and adaptive learning, the distinction between personalized and adaptive assessment is less contested, but perhaps also less widely discussed. As before, my definition will hinge upon the role of human decisionmaking in distinguishing between adaptive (machine-intelligent) and personalized (machine- and human-intelligent) assessment.

Most understand adaptive assessment in the context of computerbased adaptive testing, which adapts the parameters of the test to student performance in real time. Acing a series of questions may earn more challenging questions, while fumbling on several may elicit easier questions or increased assistance.

In a slightly different perspective on adaptive assessment, the BLE Group suggests that formative and summative (benchmark) assessments “measure whether an academic standard has been learned,” while adaptive assessments “measure growth and identify where the student is on the learning-ladder continuum,” and diagnostic assessments “determine missing skills and remediate them.” I see these as overlapping rather than distinct categories. Adaptive testing is already widespread in summative assessment. Further, questions may be adapted to discriminate between students or to measure mastery of key concepts and skills at any point along an expected learning trajectory, whether a prerequisite or endpoint.

How personalization goes beyond adaptivity in assessment is in explicitly incorporating the decisions of the persons involved in the process, the primary stakeholders being the learner, the instructor-grader, and the external audience interpreting the grade. I will focus on just these three roles as the simplest case, although other configurations may include separate roles for instructor and grader, peer grading, and group assessment.

Learners differ in the goals they have for their education, both in what they hope to learn and in what they will be expected to present as documentation of that learning. Some careers require a particular degree or certificate, while others may solicit work portfolios or barrage candidates with tricky interview questions. Some students seek a general liberal-arts education rather than job-specific training, while others may simply want to broaden their knowledge without regard for the mark received for it. The initial choices are left up to the learner, and the information sought is determined by the audience for the assessment (i.e., the employer, certifying organization, or society). Thus, tailoring what gets assessed and how results are presented around those expectations would entail personalization rather than adaptivity.

How to present assessment information to learners and instructors may also vary depending on their preferences and abilities for interpreting and responding to such information. In some cases, these factors may be adapted based on evaluations of their actual behaviors (e.g., a learner who disengages after seeing comparisons against peers). In other cases, users may have access to better information or more sophisticated responses than the adaptive system, and an appropriately personalized system would allow them to choose their action based on that information. Examples include a learner getting distracted upon trying to interpret very-frequent feedback (with the system failing to distinguish loss of focus from intent, productive concentration), or an instructor recognizing when personal contact would help. Again, personalization builds in opportunities for human intervention to take over when the adaptive system is less suited for the task.

While these distinctions may not seem that significant, highlighting them here enables more precision in examining the criticisms of personalized and adaptive learning. Many limitations apply specifically to adaptive learning systems that do not leave enough room for individual choice or personal interaction. Adopting a fairly broad view allows us to focus on the possibilities and constraints regarding where these developments can go, not just the shortcomings of what some particular instantiations have done so far.

Why personalized learning and assessment?

Much of the recent buzz in educational technology and higher education has focused on issues of access, whether through online classes, open educational resources, or both (e.g., massive open online courses, or MOOCs). Yet access is only the beginning; other questions remain about outcomes (what to assess and how) and process (how to provide instruction that enables effective learning). Some anticipate that innovations in personalized learning and assessment will revolutionize both, while others question their effectiveness given broader constraints. The goal of this blog is to explore both the potential promises and pitfalls of personalized and adaptive learning and assessment, to better understand not just what they can do, but what they should do.