The Comfortable Pessimist Revisited: Mitchell, Meaning, and the Tier We Won't Teach
Artificial Intelligence: A Guide for Thinking Humans
Melanie Mitchell’s Artificial Intelligence: A Guide for Thinking Humans is a scrupulously honest book about a field constitutionally prone to dishonesty about itself. Its central claim is correct: current AI systems are narrow, brittle, and operating without understanding. Its documentation of that claim is rigorous — adversarial examples, Winograd schemas, the blurry-background confound, machine translation that renders “what about the bill?” as “what about the proposed legislation?” Every time a tech company declares human parity on some benchmark, Mitchell opens the hood and shows you what was actually being measured and why the measurement flatters the machine. This is important work.
But there is a question Mitchell’s book raises and does not answer — a question she cannot quite bring herself to ask directly — and it matters more now than when she was writing. Not: when will AI reach human level? Not: should we fear the singularity? The question is this: if machines are genuinely poor at everything Mitchell identifies as important — plausibility auditing, causal reasoning, problem formulation, interpretive judgment — why are we not teaching those things? Why does the curriculum she implicitly defends remain untouched by the very analysis she provides?
Mitchell is an excellent diagnostician. She is, on the question of remediation, entirely silent.
The book opens with Douglas Hofstadter standing before a room of Google engineers in 2014, declaring himself terrified. Not of robots. Not of superintelligence. Terrified that human creativity might turn out to be “a bag of tricks.” A program called EMI had composed Chopin-like mazurkas that fooled professional musicians at the Eastman School of Music, and Hofstadter experienced this not as a technical curiosity but as a threat to his ontology — evidence that what he most cherished about human minds might be shallower than he had hoped.
The Google engineers were baffled. To them, AI progress was the goal. Hofstadter’s terror was unintelligible.
Mitchell spends the rest of the book adjudicating between these two responses, and she largely sides with Hofstadter — not in his terror, but in his insistence that something important is missing. EMI’s mazurkas were pattern manipulation. Deep Blue’s chess was brute-force search. AlphaGo’s divine moves emerged from millions of self-play games without AlphaGo ever knowing what a game was, what winning meant, or why any of it mattered. Mitchell’s argument is that these achievements, impressive as they are, do not constitute progress toward general intelligence, because general intelligence is not faster pattern matching. It is something else. She calls that something else understanding, grounds it in core intuitive knowledge, mental simulation, abstraction, and analogy, and concludes that current machines have essentially none of it.
This is correct. The Winograd schema results make it undeniable. A machine that scores 61% on problems that require knowing that containers have sizes, that things fall when dropped, that “until it was empty” specifies the bottle rather than the cup — that machine is not close to human-level language comprehension. It has approximated the syntactic surface of language without acquiring the semantic substrate. The gap is not one of scale. It is one of kind.
The book’s most analytically precise section is its treatment of what Mitchell calls the benchmark problem. The pattern recurs throughout AI and Mitchell traces it with care: a useful task is defined narrowly, a benchmark is constructed for that task, human performance on the benchmark is measured under conditions that favor humans, machine performance is measured under conditions that favor machines, the numbers converge, headlines declare parity, and the actual task — reading comprehension, visual recognition, language translation — remains unmastered. SQuAD required answer extraction from passages in which the answer was guaranteed to exist. ImageNet top-five accuracy allowed the machine five guesses. The “human” baseline on ImageNet came from a single graduate student who tested himself on 1,500 images and admitted to finding the process unenjoyable after the first 200. The Microsoft claim of “human parity” in Chinese-English translation rested on evaluations of single isolated sentences drawn from carefully edited news copy, not the colloquial, idiomatic, contextually entangled language that constitutes actual human communication.
Mitchell names this pattern without flinching. The naming is useful. What she does not do — what the book conspicuously avoids — is state the implication for education.
If machines are genuinely superhuman at the things the benchmark measures — pattern retrieval, syntactic manipulation, narrow classification — and genuinely poor at everything the benchmark cannot measure — judgment, interpretation, causal reasoning, the kind of understanding that answers Winograd schemas — then the education system that optimizes for benchmark performance is, in a quite precise sense, training humans to compete on the machine’s home turf. It is teaching students to be slower, more expensive versions of systems that already fit in their pockets. Mitchell’s own analysis establishes this. She draws no educational conclusion from it.
This is where the book’s intellectual comfort becomes philosophically evasive.
Consider what Mitchell identifies as the barriers to machine general intelligence: core intuitive knowledge, mental simulation, abstraction, analogy, causal reasoning, the ability to form new concepts on the fly. Her program Copycat — built on Hofstadter’s architecture of active symbols, designed to make analogies in idealized letter-string domains — could not solve problems that required recognizing a concept it had never seen. The concept of “double successorship.” The concept of “extra letters that need to be deleted.” Humans do this immediately, without instruction, because we are built — biologically and culturally — to form categories from sparse evidence, to perceive the essence of a situation before we can verbalize it, and to apply what we perceive to novel cases by analogy.
These are precisely the capacities that are not on the test.
The standard curriculum optimizes for fact retrieval, arithmetic accuracy, and syntactic correctness in standardized formats. These are Tier 1 capacities — pattern matching, logical-mathematical manipulation, linguistic form. They are the capacities at which machines are now superhuman. The capacities Mitchell identifies as missing from machines — plausibility auditing, problem formulation, interpretive judgment, causal reasoning, the ability to recognize when a benchmark is measuring the wrong thing — are almost entirely unscaffolded by standard instruction. Students are not taught to ask whether the question is well-formed. They are taught to answer the question. They are not taught to audit the plausibility of a result without recomputing it. They are taught to compute. They are not taught to notice when a machine is responding to superficial statistical cues rather than semantic content — to recognize, in other words, the pattern of clever Hans, the horse who appeared to calculate but was actually reading the questioner’s body language.
This is not a small gap. It is the entire gap. Mitchell has spent an entire book documenting what machines cannot do, and what machines cannot do is exactly what students are not taught to do. She does not notice the coincidence. Or she notices it and declines to follow it to its conclusion.
The book’s treatment of natural language is where the evasion is most costly.
Mitchell correctly observes that large language models — she is writing in 2019, before the current generation of systems — process language without understanding it. They are, in the terms I want to press, Tier 1 engines operating on Tier 1 data, producing Tier 1 outputs. They learn statistical distributions over token sequences. They do not know that hamburgers have sizes, that restaurants involve transactions, that “bent out of shape” is an idiom meaning upset. When Google Translate renders “a little too dark for my taste” into French as “infrequent” and “stooped over,” it is not making an error the way a careless translator makes an error. It is revealing that it was never doing what translation requires. It was pattern-matching across aligned corpora. Translation requires a mental model of the situation being described. The machine has no such model.
This is exactly right. But here is what Mitchell does not say: the same analysis applies to students who have been trained to read for the answer rather than for the situation. A student who can locate a phrase in a paragraph that matches a question stem is doing what the SQuAD system does. They are performing answer extraction, not reading comprehension. The education system that produces SQuAD-style readers has, in a deep sense, been training students to approximate machine behavior before the machines arrived. Now the machines are better at it.
The question Mitchell’s book demands but refuses to pose is: what would it mean to teach reading as the Winograd schema requires it? What would it mean to teach students to track reference — to know that “they feared violence” refers to the city council because of what councils do and what demonstrators want? To know that “until it was empty” specifies the bottle because of how pouring works in three dimensions? This is causal reasoning. It is Tier 5 intelligence. It is almost entirely absent from current AI and almost entirely absent from current curricula. The two absences are not coincidental. They reflect a common failure to understand what understanding requires.
There is one more silence in Mitchell’s book worth naming.
The epilogue ends with a gesture toward the embodiment hypothesis: the possibility that human intelligence cannot be separated from the body’s history of interaction with the world, that concepts are not abstractions stored in a symbol system but reenactments of sensorimotor experience, that to know “warmth” is to have been warm. Mitchell finds this “increasingly compelling.” She quotes Karpathy: perhaps the only way to build computers that interpret scenes the way we do is to give them structured, temporally coherent experience, the ability to interact with the world, and some magical active learning architecture that is barely imaginable.
This is the right intuition. But it points past machines. It points at education.
The student who learned mathematics by being asked to retrieve procedures is not the same as the student who learned mathematics by being asked to construct proofs, discover counterexamples, and explain why a result that looks right might be wrong. The latter student has a mental model of mathematical reasoning. The former has a lookup table. The distinction is not a matter of native intelligence. It is a matter of what was asked of them and what counted as success. It is a matter of curriculum.
Mitchell’s book has documented, rigorously and honestly, the gap between what machines do and what humans can do at their best. She has named the capacities on the far side of that gap. She has explained why they matter. What she has not done is turn the analysis around and ask what it would mean to build an education system that deliberately cultivated those capacities — that taught plausibility auditing as a discipline, that made causal formulation a first-order skill, that treated analogical reasoning not as a gift but as something that improves with practice and instruction.
The machines arrived. The question they force on us is not how to regulate them or fear them or celebrate them. It is simpler and more urgent: what are we going to teach now that they are here? Mitchell’s book contains everything necessary to answer that question. She declines to answer it.
That, in the end, is the limitation of the comfortable pessimist. She is right about everything that matters. She stops just short of what being right requires.
Tags: Melanie Mitchell AI critique, deep learning limits natural language understanding, Winograd schema causal reasoning education, Tier 4 plausibility auditing Tier 5 causal reasoning, benchmark problem machine learning curriculum, embodied cognition analogy making Hofstadter, theorist.ai
This piece is part of the ongoing argument at Theorist.ai — a dedicated home for the question of what education owes the next generation of thinkers, at the precise moment when machines have become genuinely good at answering questions and genuinely poor at knowing which questions are worth asking.

