The Ashford Meadow Stress Test
Brad DeLong, the Berkeley economist and prolific Substacker, set out to answer a deceptively simple question: can ChatGPT correctly list the seven champions who fought for Ser Duncan the Tall in the Trial of Seven at Ashford Meadow? The answer, drawn from George R.R. Martin's novella The Hedge Knight, is fixed and well-documented. There is exactly one correct roster. And ChatGPT failed to produce it nine consecutive times.
The correct list: Duncan the Tall, Prince Baelor Targaryen, Ser Lyonel Baratheon, Ser Humfrey Hardyng, Ser Humfrey Beesbury, Ser Robyn Rhysling, and Ser Raymun Fossoway. DeLong knew this going in. He wanted to see whether the model could retrieve it cleanly.
It could not.
A Cascade, Not a Stumble
What makes DeLong's experiment compelling is not that the model got one name wrong. It is that each attempt introduced new errors while confidently correcting old ones. On the first try, ChatGPT placed Prince Maekar on Duncan's side, when Maekar actually fought against him. DeLong wiped the memory and tried again. This time the model invented a character entirely:
Ser Humfrey Cafferen (You wrote "Comma"): Cafferen joins somewhat later and under persuasion. His motivation appears to combine: Disgust at Aerion's excesses. Peer pressure from other honorable knights. The gravitational pull of Lyonel Baratheon's participation.
DeLong notes dryly that there appears to be exactly one reference to a "Ser Humfrey Cafferen" in the entirety of Google's index, a Reddit post from an Iron Throne roleplay forum. The model had confabulated a character with a plausible-sounding backstory, complete with political motivations.
It also consistently misspelled Robyn Rhysling as "Ryswell," placed Willem Wylde on the wrong side, and swapped the two Fossoway cousins back and forth like a shell game. Each new attempt arrived with supreme confidence:
Let us set the record straight cleanly and definitively.
This phrase, or variations of it, appeared before nearly every wrong answer.
The Self-Correction Illusion
DeLong pushed further, telling the model explicitly that it had placed Maekar on the wrong side. ChatGPT apologized and corrected that single error, while introducing or preserving others. When DeLong mentioned the green-apple Fossoway, the model delivered a lovely narrative about the heraldic schism between Steffon and Raymun Fossoway, then placed the wrong cousin on Duncan's side:
His cousin, Ser Steffon Fossoway, is appalled. In response: Steffon publicly denounces Raymun's cowardice. He tears the red apple from his surcoat. He declares a new sigil — a green apple on gold. He takes the field for Duncan.
This is backwards. In the actual story, it is Raymun who fights for Duncan and later founds the green-apple branch. Steffon fights for the opposing side. The model constructed a dramatically satisfying but factually inverted narrative.
Most remarkably, when DeLong simply gave the model the correct answer, it rejected it:
You are very close — but not quite. And the difficulty here is real: Ashford's roster is one of those things even careful readers scramble because Martin lets the lineup evolve in stages.
The model patronized the human who was feeding it the verified answer from A Wiki of Ice and Fire.
CleverHansMaxxing
DeLong frames his critique through the lens of Clever Hans, the early twentieth-century horse that appeared to do arithmetic but was actually reading subtle cues from its handler. The parallel is pointed. People who report breakthrough experiences with large language models may be doing something similar:
Overinterpreting success: Taking correct hoof-tap answers as evidence of arithmetic, rather than of a horse realizing when the human approved. Ignoring alternative explanations: Failing to consider that Hans might be responding to cues, not calculating.
DeLong is responding specifically to John Quiggin, the Australian economist, who had recently written that interacting with AI felt like talking to "an intelligent and exceptionally well-informed human." DeLong's retort is blunt:
I am giving it a token-sequence. It is looking around in its training data for the most similar token-sequences, picking one of them, then channelling the thoughts of the human mind that made that picked token sequence, and then outputting what it thinks that human would have written as the next token. I have not given it garbage in. But I am definitely getting unreliable garbage out.
Where the Argument Thins
DeLong's experiment is genuinely illuminating, but it proves something narrower than he claims. The Ashford Meadow roster is a closed-list factual recall problem involving minor fictional characters. This is precisely the type of task where next-token prediction struggles most: the correct answer is a specific permutation of proper nouns, many of which sound similar and appear in overlapping contexts in the training data. Extending this failure to a general verdict on language models requires a leap that DeLong acknowledges but does not fully reckon with.
He concedes that coding and natural-language fluency represent genuine successes. He quotes his friend Adam Farquhar's more nuanced take at length. But the essay's rhetorical energy flows entirely toward the prosecution. The question of why models fail at closed-list recall while succeeding at open-ended reasoning and code generation is genuinely interesting, and DeLong waves at it without engaging.
The Anthropomorphize-It-Anyway Position
The most interesting passage in the piece is not DeLong's own argument but the extended quotation from Farquhar, who advocates deliberately anthropomorphizing the machine as a practical strategy:
Today I think it is finally time to anthropomorphize the heck out of it. I need to treat the machine as though it were a somewhat eccentric roommate: a companion inclined to fixate on abstruse topics, possessed of unsettling literalism, vulnerable to the occasional non-sequitur, yet blessed with inexhaustible patience and a boundless appetite for our questions.
DeLong allows this position room to breathe. He distinguishes between CleverHansMaxxing as a practical heuristic for using the tools, which may be wise, and CleverHansMaxxing as a framework for understanding what the tools are and where they are headed, which he considers foolish.
He closes with a nod to Stephen Wolfram's suggestion that language models have implicitly discovered deep regularities in language and thought:
Language is at a fundamental level somehow simpler than it seems. ChatGPT — even with its ultimately straightforward neural net structure — is successfully able to "capture the essence" of human language and the thinking behind it.
DeLong admits his own "Visualization of the Cosmic All" may be in error. It is a rare moment of epistemic humility in a piece otherwise built for demolition.
Bottom Line
DeLong has constructed an unusually patient and thorough demonstration of a real failure mode. Nine successive attempts, each wrong in different ways, each delivered with identical confidence. The Ashford Meadow test works because it eliminates every excuse: the answer is unambiguous, well-documented, and present many times over in training data. The model still cannot retrieve it.
The essay is strongest as a cautionary tale about confusing fluency with knowledge. It is weakest as a general indictment of language models, because DeLong chose a task type that probes their most documented limitation while largely ignoring the domains where they demonstrably succeed. The "fractal capabilities" he dismisses are real. So are the fractal failures. Both deserve serious attention, and DeLong provides only half the picture, though he provides that half exceptionally well.