Name it → Teach it → Measure it

The three-stage sequence that educational reform keeps failing to complete

Mar 15, 2026

The three-stage sequence that educational reform keeps failing to complete — and why the taxonomy in Knowing Enough to Distrust the Machine is only Stage 1

There is a cemetery in the literature of educational reform. It is vast, well-tended, and populated almost exclusively by ideas that died not because they were wrong but because they were unfinished. Howard Gardner gave us Multiple Intelligences in 1983. Hundreds of thousands of educators read it, recognized something true in it, and taped posters to classroom walls. Forty years later, there is still no peer-reviewed, validated assessment for intrapersonal intelligence. The framework became vocabulary. It never became a research program. Gardner named something real. He did not finish the work. That gap has a name.

This is the Gardner Trap. And it is, I would argue, the primary failure mode of educational reform — not bad ideas, but good ideas that stopped at the first stage and called it done.

What I am about to argue about other people’s frameworks, I am also arguing about my own.

The Three Stages Nobody Finishes

Name it → Teach it → Measure it is a rebrand — and I am being transparent about that because transparency is the point. It is Backwards Design made plain enough to tweet. It is Evidence-Centered Design made legible to an educator who has never opened a psychometrics journal. It is Constructive Alignment stripped of its academic register and handed back to the researcher who needs it as an action sequence, not a citation.

The underlying logic has existed for decades under names that have achieved exactly the fate I am arguing against: Evidence-Centered Design, Backwards Design, Constructive Alignment. These are the same three-stage sequence rendered in different vocabularies for different audiences. Psychometricians got ECD. K-12 curriculum designers got Backwards Design. University faculty got Constructive Alignment. None of them talked to each other, and none of them got the whole field.

The argument for rebranding is not aesthetic. It is empirical. Branding is infrastructure. The research on idea diffusion is unambiguous: the perceived characteristics of an innovation — its observability, its trialability, its relative advantage as understood by the adopter — determine its rate of adoption more reliably than the quality of its underlying evidence. The SAMR Model has no peer-reviewed validation and is taught in virtually every EdTech professional development program in the country. Evidence-Centered Design has rigorous psychometric grounding and is known almost exclusively by specialists. This is not an accident. This is how the idea economy works, and pretending otherwise does not serve anyone.

The cynical reading of that evidence is: brand your ideas aggressively and the evidence will follow. I am making the opposite argument. If the idea is sound — if you have done the work, if the research is honest, if the construct is real — then branding is the tool that closes the gap between the quality of the idea and the reach of its impact. Phonemic awareness did not achieve mass adoption because it had a catchy name. It achieved mass adoption because it had a catchy name and a validated pedagogy and a reliable assessment battery, and that complete package gave educators something they could actually use. The name opened the door. The evidence furnished the room.

Name it → Teach it → Measure it is the sequence that makes an idea complete.

Why This Matters Now

The taxonomy I published at Theorist.ai — seven tiers of human intelligence organized around the question of what machines can and cannot do — is a naming exercise. I am saying that plainly because the honesty is the credibility. The taxonomy names constructs: plausibility auditing, problem formulation, causal reasoning, metacognitive oversight. It argues that these are the tiers that current education leaves almost entirely unscaffolded, and that this gap is now an emergency because machines are superhuman at Tier 1 and genuinely absent at Tier 7, and the curriculum has not noticed.

But naming is Stage 1. The taxonomy is Stage 1. And the Gardner Trap is right there, waiting, the moment I declare the naming done and walk away.

Stage 2 asks a harder question: what does a lesson that actually develops plausibility auditing look like? Not in theory. In practice. On a Tuesday. With thirty students who have a midterm on Thursday. What is the intervention? What is the activity? What does the teacher do differently tomorrow morning if they accept the argument of the taxonomy? The research on spaced practice and interleaving has been robust for decades and classroom adoption is still slow — not because educators are lazy or incurious, but because the research never crossed into curriculum design. It stayed at the level of findings and never became a lesson plan. The name was “spaced practice.” The lesson plan did not exist.

Stage 3 is the hardest. How do you know whether the lesson worked? This is where almost every curriculum reform fails. The Gardner problem is fundamentally a measurement problem: he named intelligences that could not be assessed, which meant they could not be taught with accountability, which meant the poster stayed on the wall and the pedagogy never changed. The 21st century skills movement is suffering the same fate right now. “Critical thinking” appears in virtually every school’s mission statement. There is no agreed-upon, validated measure of critical thinking that a classroom teacher can deploy in forty minutes. So “critical thinking” is a value, not a curriculum target. You cannot teach what you cannot measure. You cannot improve what you cannot assess.

The Precision Threshold

There is a specific point in the development of a construct at which it becomes researchable rather than merely discussable. I think of it as the precision threshold. A construct crosses the threshold when two researchers, working independently, would recognize the same behavior in the same student — when the definition is specific enough to generate comparable tasks and comparable scoring criteria without additional negotiation.

Phonemic awareness crossed the threshold. “The ability to hear, identify, and manipulate individual sounds in spoken words” is precise enough that researchers built standardized tools, teachers ran interventions, districts tracked outcomes. The construct was operationalized. The research became a program.

“Multiple intelligences” did not cross the threshold. “Musical intelligence” is a real phenomenon, but it was never defined precisely enough to distinguish from musical talent, musical experience, or general pattern recognition applied to pitch. Without that precision, no validated assessment was possible. Without assessment, no accountability. Without accountability, no feedback loop. The idea spread everywhere and changed almost nothing.

The constructs I am naming in the taxonomy — plausibility auditing, problem formulation, causal reasoning — are at different points in relation to the precision threshold. Causal reasoning is already being assessed with something like the Clear-3K benchmark, which uses 3,000 assertion-reasoning questions to evaluate whether a subject can distinguish genuine causal explanatory relationships from semantic relatedness. That construct has crossed the threshold or is close. Problem formulation has validated rubrics in mathematics education that classify the complexity of student-generated problems. These fields have done the work. The question is whether that work can be connected to a K-12 curriculum sequence and a classroom-ready assessment — whether it can move from research to practice without losing its integrity.

Plausibility auditing is the hardest case. We know what we mean by it: the capacity to ask, when confronted with a confident output from any source — human or machine — is this plausible, and how would I know? But defining it precisely enough to build a lesson around, and an assessment that follows, is harder than it sounds. It requires distinguishing plausibility auditing from general skepticism, from domain expertise, from critical thinking in its vague 21st-century-skills incarnation. It requires identifying the specific behaviors that a student who is good at plausibility auditing exhibits that a student who is not does not. That work is Stage 1, continued.

The Adoption Problem Is a Completion Problem

The deep research on branding in educational frameworks reveals a pattern that looks like a paradox but is not. The frameworks that achieved mass adoption — Bloom’s Taxonomy, Growth Mindset, Multiple Intelligences, SAMR — are almost all Stage 1 only: they name something, sometimes they suggest broad pedagogical orientations, but they do not provide validated assessments that tell a teacher whether the teaching worked. The frameworks that completed the sequence — phonemic awareness, number sense, Self-Regulated Strategy Development for writing — are known primarily to specialists in their specific domains and have not become general cultural vocabulary.

This looks like: good branding wins over good evidence. But that reading is too shallow. What it actually shows is that Stage 1 without Stages 2 and 3 produces cultural vocabulary without practice change, while Stages 2 and 3 without Stage 1 produces practice change without cultural penetration. Neither is the full win. The full win is phonemic awareness with a name that travels — which it did, in the context of the “reading wars” and the National Reading Panel, where the political stakes forced Stage 1 and Stages 2–3 into alignment.

The AI era is creating a similar forcing function. The stakes of getting this wrong are visible and proximate. Employers can already see, in the hiring cycle, the difference between graduates who can use AI tools and graduates who are used by them. The difference is not tool familiarity. It is the Tier 4 and Tier 5 capacities: plausibility auditing, problem formulation, causal reasoning, metacognitive oversight. Those are the capacities that determine whether a person can work with AI productively or be productively fooled by it.

That urgency is Stage 1’s best ally. The name travels farther when the stakes are clear. And the stakes have never been clearer.

What I Am Committing To

The taxonomy is a first draft. The name Name it → Teach it → Measure it is a deliberate reframe of Evidence-Centered Design for an audience that does not read psychometrics journals but does make curriculum decisions. These are both honest acts, and I am saying so because the alternative — claiming novelty I do not possess — is the branding equivalent of the frameworks I am critiquing.

What is new is the application. What is new is the argument that these specific constructs, at this specific moment, are ready and urgent for the full three-stage treatment. What is new is the insistence that stopping at naming is now malpractice, because we have watched enough frameworks become posters to know what that outcome looks like, and we cannot afford it here.

The machines are already in the classroom. They are already in the hands of the students taking the tests, writing the papers, solving the problems that current assessments were designed to measure. The curriculum that does not notice this is preparing students to compete on the machine’s home turf — which is the most expensive preparation possible, because the machine will always win at Tier 1, and the student who has only Tier 1 has nowhere left to go.

Name it. Teach it. Measure it. In that order.

Not because the sequence is novel. Because finishing it is the only thing that actually changes anything.

Tags: Name it Teach it Measure it curriculum methodology, Gardner Trap educational reform failure modes, Evidence-Centered Design Backwards Design rebranding, plausibility auditing AI era pedagogy, Knowing Enough to Distrust the Machine Theorist.ai taxonomy

Theorist AI

Discussion about this post

Ready for more?