AI literacy - lecture 7.1: Diffusion models - text-to-image generative AI

Kenny Easwaran delivers a crucial correction to the popular narrative that AI image generators are simply "magic" or artistic geniuses. Instead, he argues that these systems are fundamentally limited by their training data, which prioritizes broad visual impressions over the specific, high-frequency details humans care about most. This is not just a technical breakdown; it is a lesson in why the AI's "hallucinations" are actually a predictable feature of how it sees the world compared to us.

The Short History of a Long Dream

Easwaran begins by dismantling the myth that generative AI is a sudden, overnight phenomenon. He notes that while the public exploded over a viral image of a pope in a puffy jacket in March 2023, the technology had been maturing quietly for years. "The history is surprisingly short for both of them," Easwaran writes, pointing out that the foundational models were less than a year and a half old when they hit the mainstream. He traces the lineage back to the 1950s, distinguishing modern generative tools from the computer graphics of the 1980s and 90s used in films like Terminator 2 or Toy Story. Those earlier tools were "working on behalf of humans," whereas modern systems attempt to generate cultural products autonomously.

AI literacy - lecture 7.1: Diffusion models - text-to-image generative AI

This distinction is vital for understanding the current landscape. The shift from symbolic AI, which tried to explicitly program rules for object recognition, to neural networks, which learn patterns from data, is where the magic—and the mess—begins. Easwaran highlights the 2012 breakthrough of AlexNet, which proved that neural nets could outperform all previous methods in image classification. But the leap from classifying an image to creating one required a different architecture entirely. He describes the evolution of Generative Adversarial Networks, where two neural nets engage in an "arms race" to fool each other, eventually leading to the diffusion models that dominate today.

"Throughout most of the history of artificial intelligence, people have been thinking in terms of computers processing information, identifying things, making recommendations, not so much making works of art."

This framing effectively grounds the reader. It prevents the panic or awe that often accompanies AI news by reminding us that this is a specific, recent branch of a much older tree. The argument holds up well: the technology is powerful, but it is not omnipotent, and its limitations are rooted in its specific developmental path.

The Blind Spot of the Machine

The core of Easwaran's analysis focuses on the most glaring failures of current models: hands and text. He presents a compelling case that these aren't random glitches but structural blind spots. "The AI is just trained on a large amount of data, just a bunch of images of all different sorts," he explains. "While we humans care particularly specially about human body parts and text, the AI doesn't really treat that as anything more or less special than any other sort of image that it looks at."

He uses a fascinating analogy involving a 1770 European painting of a kangaroo. To the painter, who had never seen a kangaroo, the animal looked like a cross between a mouse and a deer. Similarly, to the AI, human hands are just another object in the dataset, not a biological imperative with a fixed number of fingers. "To the AI, that's what humans are," Easwaran argues. "We're just some animal that shows up in its pictures, and it gets the details roughly right as far as it can tell."

This insight reframes the "uncanny valley" not as a failure of the AI to be human, but as a failure of the AI to prioritize human-specific details. The training data itself is biased by human behavior. Easwaran points out that in most photographs, hands and text are peripheral; they are rarely the main subject. "Since the hands and text aren't very often in photographs, they don't generally make up the majority of a photograph," he notes. Consequently, the AI never learns to count fingers or spell words with the precision a human artist would. It learns the impression of a hand, not the anatomy.

"Human artists have also known for ages that drawing hands is remarkably difficult... and so this is one of the very important sections of an art class is just practicing actually looking at a hand and seeing what it looks like to draw it."

Critics might argue that as models scale and training data becomes more curated, these issues will simply vanish. However, Easwaran's point about the fundamental nature of neural networks—that they are "much better at getting impressions and patterns rather than things like counting or getting diagrams right"—suggests a deeper architectural hurdle. Without symbolic reasoning built in, the AI may always struggle with tasks requiring exact logic, like counting fence posts or rendering a readable floor plan.

The Limits of Pattern Recognition

Easwaran concludes by examining the gap between the AI's output and human intent. He shows examples of real estate diagrams generated by Midjourney that look convincing from a distance but fall apart upon inspection. The rooms don't connect, and the doors don't make sense. "If you know how to look at a real estate diagram, you see that these make no sense at all," he observes. The AI captures the aesthetic of a diagram without understanding the function of a building.

This is the critical takeaway for anyone relying on these tools. The system is a master of style and a novice of substance. It can mimic the look of a 1960s California suburb or a French apartment with startling accuracy, but it cannot be trusted to get the details right. "These neural nets don't have that built in," Easwaran says regarding the ability to count or reason logically. The result is a tool that is incredibly useful for brainstorming and mood boarding but dangerous if mistaken for a source of factual or structural truth.

"The AI is just trained on a large amount of data... while we humans care particularly specially about human body parts and text, the AI doesn't really treat that as anything more or less special than any other sort of image."

This observation is the piece's strongest asset. It moves the conversation beyond "Is the AI good?" to "What exactly is the AI good at?" By identifying that the AI's "blind spots" are actually a reflection of the data it was fed, Easwaran provides a clear framework for understanding the technology's future limitations.

Bottom Line

Easwaran's lecture is a necessary antidote to the hype cycle, grounding the excitement of generative AI in the reality of its training data and architectural constraints. The strongest part of his argument is the reframing of AI errors not as bugs, but as logical consequences of how the machine perceives the world compared to humans. The biggest vulnerability remains the rapid pace of development; while his analysis of current models is sound, the industry is actively working to patch these specific holes, meaning the "blind spots" he identifies may shrink faster than the historical context suggests. Readers should watch for how the industry addresses the fundamental lack of symbolic reasoning in these systems, as that is the true barrier to reliable utility.

The Short History of a Long Dream

The Blind Spot of the Machine

The Limits of Pattern Recognition

Bottom Line

Sources

AI literacy - lecture 7.1: Diffusion models - text-to-image generative AI