Balancing human-human and human-computer interaction

A fundamental challenge in implementing personalized learning is in determining just how much it should be personal—or interpersonal, to be more specific. Carlo Rotella highlights the tension between the customization afforded by technology and the machine interface needed to collect the data supporting that customization. He narrows in on the crux of the problem thus:

For data to work its magic, a student has to generate the necessary information by doing everything on the tablet.

That invites worries about overuse of technology interfering with attention management, sleep cycles, creativity, and social relationships.

One simple solution is to treat the technology as a tool that is secondary to the humans interacting around it, with expert human facilitators knowing when and how to turn the screens off and refocus attention on the people in the room. As with any tool, recognizing when it is hindering rather than helping will always remain a critical skill in using it effectively.

Yet navigating the human-to-data translation remains a tricky concern. In some cases, student data or expert observations can be coded and entered into the database manually, if worthwhile. Wearable technologies (e.g., Google Glass, Mio, e-textiles) seek to shorten the translation distance by integrating sensory input and feedback more seamlessly in the environment. Electronic paper, whiteboards, and digital pens provide alternate data capture methods through familiar writing tools. While these tools bring the technology closer to the human experience, they require more analysis to convert the raw data into manipulable form and further beg the question of whether the answer to too much technology is still more technology. Instructional designers will always need to evaluate the cost-benefit equation of when intuitive human observation and reflection is superior, and when technology-enhanced aggregation and analysis is superior.

 

On the realistic use of teaching machines

From the perspective that all publicity is good publicity, the continued hype-and-backlash cycle in media representations of educational technology is helping to fuel interest in its potential use.  However, misleading representations, even artistic or satirical, can skew the discourse away from realistic discussions of the true capacity and constraints of the technology and its appropriate use. We need honest appraisals of strengths and weaknesses to inform our judgment of what to do, and what not to do, when incorporating teaching machines into learning environments.

Adam Bessie and Arthur King’s cartoon depiction of the Automated Teaching Machine convey dire warnings about the evils of technology based on several common misconceptions regarding its use. One presents a false dichotomy between machine and teacher, portraying the goal of technology as replacing teachers through automation. While certain low-level tasks like marking multiple-choice questions can be automated, other aspects of teaching cannot. Even while advocating for greater use of automated assessment, I note that it is best used in conjunction with human judgment and interaction. Technology should augment what teachers can do, not replace it.

A second misconception is that educational programs are just Skinner machines that reinforce stimulus-response links. The very premise of cognitive science, and thus the foundation of modern cognitive tutors, is the need to go beyond observable behaviors to draw inferences about internal mental representations and processes. Adaptations to student performance are based on judgments about internal states, including not just knowledge but also motivation and affect.

A third misconception is that human presence corresponds to the quality of teaching and learning taking place. What matters is the quality of the interaction, between student and teacher, between student and peer, and between student and content. Human presence is a necessary precondition for human interaction, but it is neither a guarantee nor a perfect correlate of productive human interaction for learning.

Educational technology definitely needs critique, especially in the face of its possible widespread adoption. But those critiques should be based on the realities of its actual use and potential. How should the boundaries between human-human and human-computer interaction be navigated so that the activities mutually support each other? What kinds of representations and recommendations help teachers make effective use of assessment data? These are the kinds of questions we need to tackle in service of improving education.

 

Automating assessment: How, when, and why?

EdX, the most prominent nonprofit MOOC provider, plans to use and share automated software to grade and give feedback on student essays. On the heels of this announcement come legitimate skepticism about how well computers actually grade student work (i.e., “Can this be done?”) and understandable concern whether this is a worthwhile direction for education to proceed (i.e., “Should this be done?”). Recasting these two questions in terms of how, when, and why to apply automated assessment yields a more critical framework for finding the right balance between machine-intelligent and human-intelligent assessment.

How?

When evaluating the limitations of artificial intelligence, I find it helpful to ask whether they can be classified as issues with data or algorithms. In some cases, available data simply weren’t included in the model, while in others, such data may be prohibitively difficult or expensive to capture. The algorithms contain the details of how data get transformed into predictions and recommendations. They codify what gets weighted more heavily, which factors are assumed to influence each other, and how much.

Limitations of data: Train on broader set of sample student work as inputs

Todd Pettigrew describes some familiar examples of how student work might appear to merit one grade on the surface but another for content:

it is quite common to see essays that are superficially strong — good grammar, rich vocabulary — but lack any real insight… Similarly some very strong essays—with striking originality and deep insight—have a surprising number of technical errors that would likely lead a computer algorithm to conclude it was bad.

This highlights the need to train the model on these edge cases to distinguish between style and substance, and to ensure that it does not false-alarm on spurious features. Elijah Mayfield points out that a training set of only 100 hand-graded essays is inadequate; this is just one example of the kind of information such a small sample could fail to capture adequately.

Limitations of data: Include data from beyond the assignment and the course

Another relevant concern is whether the essay simply paraphrased an idea from another source, or if it included an original contribution. Again from Todd Pettigrew:

the computer cannot possibly know how the students answers have related to what was done elsewhere in the course. Did a student’s answer present an original idea? Or did it just rehash what the prof said in class?

Including other information presented in the course would allow the model to recognize low-level rehashing; adding information from external sources could help situate the essay’s ideas relative to other ideas. A compendium of previously-expressed ideas could also be labeled as normative (consistent with the target concepts to be learned) or non-normative (such as common misconceptions), to better approximate the distance between the “new” idea and “old-but-useful” ideas or “old-but-not-so-useful” ideas. But confirming whether that potentially new idea is a worthwhile insight, a personal digression, or a flawed claim is probably still best left to the human expert, until we have better models for evaluating innovation.

Limitations of data: Optimizing along the wrong output parameters

Scores that were dashed off by harried, overworked graders provide a poor standard for training an AI system. More fundamentally, the essay grade itself is not the goal; it is only a proxy for what we believe the goals of education should be. Robust assessment relies on multiple measures collected over time, across contexts, and corroborated by different raters. If we value long-term retention, transfer, and future learning potential, then our assessment metrics and models should include those.

I recognize that researchers are simply using the data that are most readily available and that have the most face value. My own work sought to predict end-of-course grades as a preliminary proof of concept because that’s the information we consistently have and use, and our society (perhaps grudgingly) accepts that. Ideally, I would prefer different assessment data. In pointing the direction in which we ought to go with such innovations, ultimately we should identify better data (through educators, assessment experts, and learning scientists), make them readily available (through policymakers and data architects), and demand their incorporation in the algorithms and tools used (through data analysts, machine learning specialists, and developers).

Limitations of algorithms: Model for meaning

Predictive or not, features such as essay length, sophistication of vocabulary, sentence complexity, and use of punctuation are typically not the most critical determinants of essay quality. What we care about is content, which demands modeling the conceptual domain. Hierarchical topic models can map the relative conceptual sophistication of an essay, tracking the depth and novelty of a student’s writing. While a simple semantic “bag-of-words” model ignores word order and proximity, a purely syntactic model accepts grammatical gibberish. A combined semantic-syntactic model can capture not just word co-occurrence patterns, but higher-order relations between words and concepts, as evident in sentence and document structure. Compared to the approaches earning such public rebuke now, more sophisticated algorithms exist, although they need more testing on better data.

When?

The question here is which parts of the assessment process are best kept “personalized,” and which parts are best made “adaptive.”

Rapid, automated feedback is useful only if the information can actually be used productively before the manual feedback would have arrived. For a student whose self-assessment is wide of the mark, an immediate grade can offer reassurance or a kick in the pants. For others, it may enable doing just enough to get by. Idealist instructors might shudder at the notion, but students juggling competing demands on their time might welcome the guidance. How well students can make sense of the feedback will depend on its specificity, understandability, actionability, and perceived cost-benefit calculus, all open questions in need of further iteration.

For instructors, rapid feedback can provide a snapshot of aggregate patterns that might otherwise take them hours, days, or longer to develop. Beyond simply highlighting averages which an expert instructor could already have predicted, such snapshots could cluster similar essays that should be read together to ensure consistency of grading. They could flag unusual ideas in individual essays or unexpected patterns across multiple essays for closer attention. Student work could be aggregated for pattern analysis within an individual class, across the history of each student, or across multiple instances of the same class over time. Some forms of contextualization may be overwhelming or even undesirably biasing, while others can promote greater fairness and enable deeper analysis. Determining which information is most worthwhile for an instructor to know during the grading process is thus another important open question.

Both cases explicitly acknowledge the role of the person (either the student or the instructor) in considering how they may interact with the information given to them. An adaptive system would provide first-pass feedback for each user to integrate, but the instructor would still retain responsibility for evaluating student work and developing more sophisticated feedback on it, with both instructor and student continuing the conversation from there— the personalized component.

Essay-grading itself may not be the best application of this technology. It may be more aptly framed as a late-stage writing coach for the student, or an early-stage grading assistant for the instructor. It may be more useful when applied to a larger body of a given student’s work than to an individual assignment. Or it may be more effective for both student and instructor when applied to an online discussion, outlining emerging trends and concerns, highlighting glaring gaps, helping hasty writers revise before submitting, and alerting facilitators when and where to intervene. While scaling up assessment is an acknowledged “pain point” throughout the educational enterprise, automation may fulfill only some of those needs, with other innovations taking over the rest.

Why?

The purpose of technology should be to augment the human experience, not to replace or shortchange it, and education– especially writing– is fundamentally about connecting to other people. Many of the objections to automated essay grading reflect these beliefs, even if not explicitly stated. People question whether automation can capture something which goes far beyond that essay alone: not just the student’s longer learning trajectory, but the sense of a conversation between two people that extends over time, the participation in a meaningful interpersonal relationship. Whether our current instantiation of higher education currently meets this ideal is not the point. Rather, in a world where we can design technology to meet goals of our own choosing, and in which good design is a time-consuming and labor-intensive process, we should align those expensive technologies with worthwhile goals.

By these standards, any assessment of writing which robs the student and author of these extended conversations fundamentally fails. Jane Robbins claims that students need “the guidance of experts with depth and breadth in the field at hand”, a teacher who can also be “mentor, coach, prodder, supervisor.” As the students on the Brown Daily Herald’s editorial board argue:

an evaluation of an essay by a professor is just as important, if not more, to a student’s scholarship and writing. The ability to sit down and discuss the particularities of an essay with another well-informed and logical human is an essential part of the essay writing experience.

Coupled with arguments that other types of (machine-graded) assessment are better for evaluating content knowledge or even low-level critical thinking, these arguments beg the question: Why try to automate assessment of writing at all? After all, much of what I have advocated here is simply accelerating the assessment process, not truly automating it.

The most basic reason is simply that writing instruction is important, and students need ongoing practice and feedback to continue improving. To the extent that any assessment feedback can be effectively automated, it can help support this goal. That raises two additional questions: How deep must feedback be for a writing exercise to be worthwhile? More controversial, can writing that never sees a real audience still serve a legitimate pedagogical purpose?

Considering the benefits not just of actively retrieving and generating information, but also of organizing one’s thoughts into coherent expression, I would argue that some writing exercises can facilitate learning even without an audience. Less clear is how far that collection of “some” stretches, or what the specific parameters are which demand feedback from an expert human. Likely factors include more complex assignments, more extreme work quality, weaker feelings of student belonging, longer intervals between receiving human feedback, less sophisticated automated feedback, and more nuanced expert feedback. Better articulating these limits and anticipating what we can gain and lose will help guide future development and application of automated assessment.

Repositioning personalized and adaptive learning

Amidst all the excitement about personalized and adaptive learning is a lot of confusion about what the terms mean. In this post, I will examine some of the commonly-used definitions around personalized learning and explore the relationships between the ideas to clarify how I intend to use the terms here. My next post will explore the differences between personalized and adaptive assessment.

Education Growth Advisors describes personalization as “moving beyond a one-size-fits-all approach to instruction,” in which students might receive additional work as challenge or remediation, depending on their current performance. In their view, adaptivity is a more specialized form of personalization which “takes a more sophisticated, data-driven, and, in some cases, non-linear approach to remediation.” Some examples might use past performance to adjust the focus, timing, or path of content delivered to a learner.

In contrast, EdSurge places personalized learning at the top of its hierarchy with this definition: “When instruction is truly geared to the student: learning objectives, content, presentation methods and pace may all vary depending on the learner.” By their definitions, adaptive learning varies the content, individualized learning is self-paced, and differentiated learning varies the content and presentation methods, all three being lesser versions of personalization.

While both groups classify adaptive learning as a subcategory of personalized learning, they differ in where they place that subcategory along the personalized-learning spectrum, with Education Growth Advisors deeming it more sophisticated and EdSurge considering it less sophisticated compared to other forms of personalized learning. My definition will likewise situate adaptive learning as being narrower than personalized learning, but without making any claims as to which is superior in either sophistication or effectiveness. In being broader than adaptive learning, personalized learning can be both better and worse. The difference between them depends on what is being tailored to the individual, not how well it is executed.

EdSurge restricts adaptivity to selection of content; my definition will also encompass adaptivity of learning objectives, presentation methods, focus, timing, and path through content. (This is akin to Education Growth Advisors’ definition of adaptivity and closer to EdSurge’s definition of personalization.) The reasoning is that the learning environment may be tailored along all of these dimensions according to individual needs.

In my definition, personalized learning includes other features beyond adaptive learning to make the learning experience more personal—quite simply, involving the person. These may include individual preferences, chosen by the user; instructor actions, informed by a faculty dashboard offering analytics and recommendations but with decisions left up to the teacher; and social interaction, facilitated by collaborative tools and engineered to encourage productive learning, but again leaving room for human input. While adaptive learning relies on data-driven decisions from machine intelligence to tailor the experience to the learner, personalized learning also adds an explicit role for judgments made by human intelligence.

A fully personalized learning ecosystem may allocate some decisions to be determined by machine intelligence and delivered through its automated adaptive learning system, designating others to be handled by human intelligence and provided by various actors in the ecosystem. The adaptive component optimizes the environment, to the extent that it can control it, along parameters which its data and algorithms predict will be beneficial. The personalized components make allowances where human intervention is deemed to be more valuable (such as user choice, yet-to-be-modeled human expertise, and social interaction). Still, the unpredictability of human behavior also means that those components cannot claim to be truly adaptive, only personalized.

Thus, personalized learning has the potential to be much more sophisticated and powerful than adaptive learning, if realized effectively. Early iterations of personalization may not take full advantage of these possibilities, and in some cases, human decisionmaking may end up being maladaptive. Our goal is to help clarify those dimensions and parameters to enable better design and use of personalized learning.