← Back to Evaluating AI Output
Tomorrow Ready · Evaluating AI Output
Subject adaptation · Years 9 to 10 · Learning Languages · Field-Based STEM · Tony Jones
AI does not produce one answer to a language task. It produces a range of plausible answers. The student who cannot say why one answer fits better than another has not yet exercised the disciplinary judgement the course is designed to build.
Same Prompt Two Outputs makes comparison the task. Two outputs from the same prompt, one annotated comparison, one justified preference: the justification is what is marked, not the chosen output. Evaluating language requires applying a criterion, and applying a criterion requires disciplinary knowledge.
Differences are identified in English. Justification is written in English. Teacher provides a short criterion list (register, audience fit, accuracy) for students to select from before annotating. The focus is on the habit of comparison, not target-language metalanguage.
Differences are identified partly in the target language where the student has the metalanguage. Criterion is selected independently. Justification uses target-language vocabulary for linguistic concepts where the student has it.
A justification that says "Output A sounds more natural" without a reason is not sufficient. Return it with one prompt: "More natural for which audience, and what specific feature shows that?"
AI output can be grammatically accurate but culturally misaligned. Prompt students explicitly to consider cultural appropriateness as a criterion alongside accuracy, particularly for contexts involving formality, relationships, or social roles.
Evaluation Gate · Context Triage · Verification Slip