← Back to Library

The anatomy of the least squares method, part four

Tivadar Danka dismantles the myth that large language models are impenetrable black boxes, arguing instead that their internal mechanics are accessible to anyone with a high-school math background. By dissecting the GPT-2 architecture through the lens of the least squares method, Danka transforms abstract neural network theory into a tangible, code-driven investigation. This is not merely a tutorial; it is a democratization of artificial intelligence, proving that the "magic" of generative AI is actually just linear algebra in motion.

Demystifying the Machine

Danka begins by challenging the intimidation factor that often surrounds deep learning. "If you think LLMs are so complicated that they are impossible to understand, then I have bad news for you… you're wrong!" he writes. This bold opening sets the stage for a technical deep dive that refuses to treat the reader as a novice, yet remains grounded in fundamental principles. The author's approach is to strip away the mystique of the transformer model, revealing it as a series of manageable mathematical operations.

The anatomy of the least squares method, part four

The piece focuses on the "attention" mechanism, the core engine that allows these models to weigh the importance of different words in a sentence. Danka explains that this subblock "analyzes the text you've given to the model... and find pairs of words that are important." He illustrates this by noting that while a pair like ["subblock", "focus"] carries semantic weight, a pair like ["the", "of"] does not. This distinction is crucial for understanding how the model generates context-aware text.

To make this concrete, Danka guides the reader through importing the GPT-2 model using HuggingFace's libraries. He describes the model's architecture, noting that it contains 12 transformer blocks, each housing an attention subblock and a multi-layer perceptron. The complexity of the code output—listing parameter matrices and layer norms—might seem daunting, but Danka reframes it as a map of the model's internal landscape. "The architecture and algorithms are the same, but GPT2 is smaller," he notes, positioning it as the ideal educational tool, much like the MNIST dataset is for computer vision.

"Seeing math come alive in code gives you a deeper understanding and intuition — and that warm fuzzy feeling of confidence in your newly harnessed coding and machine-learning skills."

This sentiment underscores the article's pedagogical strength: it prioritizes intuition over rote memorization. By encouraging readers to run the code themselves, Danka ensures that the concepts stick. The methodology relies on "hook functions," a technical feature in PyTorch that allows external inspection of a model's hidden layers during a forward pass. This technique is the key that unlocks the black box, allowing the author to extract the specific "attention adjustment vectors" that the model generates for each token.

From Tokens to Vectors

The article then pivots to the mechanics of tokenization, the process of converting human-readable text into integers that the model can process. Danka highlights the efficiency of this system, noting that a paragraph about Hungarian rock music containing 428 characters is compressed into just 95 tokens. This compression is achieved through the byte-pair encoding algorithm, a method that segments language based on statistical co-occurrences. The connection to the publication's broader deep dive on byte-pair encoding adds a layer of historical depth here, reminding readers that the efficiency of modern AI is built on decades of statistical linguistics research.

Danka's analysis of the resulting data is rigorous. He calculates the "norm" of the adjustment vectors—a measure of their magnitude—to simplify the complex 768-dimensional data into a single, analyzable number. "The smaller the norm, the smaller the adjustment to the embeddings vector," he explains. This reduction is essential for applying the least squares method, which seeks to find the best-fitting line through a set of data points.

The visualization of this data reveals a fascinating pattern. The first layer of the model shows large, clustered adjustments as it shifts from raw word embeddings to context-aware representations. "The initial embeddings reflect the words themselves and not their context, and L0 is the first opportunity for the model to adjust the vectors according to the specific context of the prompt text," Danka writes. As the data moves through deeper layers, the vectors shift from representing the input to predicting the output. This progression is visualized through scatter plots and histograms, which Danka notes are "suitable for a least-squares regression analysis."

Critics might argue that focusing on GPT-2, an older and smaller model, limits the applicability of these findings to state-of-the-art systems. However, Danka anticipates this by emphasizing that the underlying architecture remains consistent across generations. The principles of attention and least squares regression are universal to the transformer family, making the insights from GPT-2 highly relevant to understanding modern systems.

The Regression of Thought

The climax of the piece is the application of the least squares method to predict attention adjustments. Danka sets up a regression model to test a specific hypothesis: "whether the attention adjustment to the current token can be predicted from the attention adjustment to the two preceding tokens." This is a profound question, as it probes the temporal dependencies within the model's processing. If the model makes a large adjustment to previous tokens, does it logically follow that it will make a similar adjustment to the current one?

The design matrix is constructed to test this, ignoring the first token as an outlier and focusing on the relationship between adjacent tokens. Danka's code demonstrates how to build this matrix and fit the model, turning a theoretical question into an empirical result. The use of the Freedman-Diaconis guideline to select histogram bins shows a commitment to statistical rigor, ensuring that the visualizations accurately reflect the underlying data distribution.

"You can learn a lot of math with a bit of code."

This simple statement encapsulates the article's philosophy. Danka does not just teach the math; he teaches the application of math. By grounding the abstract concept of least squares in the concrete reality of a language model's internal state, he makes the mathematics feel less like a chore and more like a tool for discovery. The connection to the publication's other deep dives, such as the one on attention mechanisms in machine learning, reinforces the idea that these concepts are interconnected parts of a larger whole.

The analysis reveals that the distribution of attention norms is "fairly normal," with a central peak and decaying counts on either side. This finding is significant because it validates the use of least squares regression, which assumes a normal distribution of errors. Danka's careful inspection of the data before modeling demonstrates a best practice in data science: always visualize your data before you analyze it.

Bottom Line

Tivadar Danka's piece is a masterclass in technical communication, successfully bridging the gap between high-level theory and practical implementation. Its strongest asset is the refusal to treat large language models as magical artifacts, instead revealing them as systems governed by understandable mathematical laws. The biggest vulnerability lies in the inherent complexity of the code, which may still deter readers without a programming background, despite the author's assurances of accessibility. For the busy professional, this article offers a vital takeaway: the future of AI is not a mystery to be feared, but a mechanism to be understood, one least squares regression at a time.

Deep Dives

Explore these related deep dives:

  • Rock music in Hungary

    Linked in the article (9 min read)

  • Byte-pair encoding

    The article specifically mentions the byte-pair-encoding algorithm as the tokenization method used by GPT models, explaining how it segments language based on statistical co-occurrences. Understanding this compression algorithm would give readers deeper insight into how LLMs process text.

  • Attention (machine learning)

    The article focuses heavily on the attention mechanism as 'the heart and soul of a language model' and uses regression to analyze attention adjustments. This Wikipedia article would provide the theoretical foundation for understanding how attention calculates word-pair importance.

Sources

The anatomy of the least squares method, part four

by Tivadar Danka · The Palindrome · Read full article

Hey! It’s Tivadar from The Palindrome.

The legendary Mike X Cohen, PhD is back with the final part of our deep dive into the least squares method, the bread and butter of data science and machine learning.

Enjoy!

Cheers,Tivadar

By the end of this post series, you will be confident about understanding, applying, and interpreting regression models (general linear models) that are solved using the famous least-squares algorithm. Here’s a breakdown of the post series:

Part 1: Theory and math. If you haven’t read this post yet, please do so!

Part 2: Explorations in simulations. You learned how to simulate and visualize data and regression results.

Part 3: real-data examples. Here you learned how to import, inspect, clean, and analyze a real-world dataset using the statsmodels, pandas, and seaborn libraries.

Part 4 (this post): modeling GPT activations. We’ll dissect OpenAI’s LLM GPT2, the precursor to its state-of-the-art ChatGPT. You’ll learn more about least-squares and also about LLM mechanisms.

Following along with code.

Seeing math come alive in code gives you a deeper understanding and intuition — and that warm fuzzy feeling of confidence in your newly harnessed coding and machine-learning skills. You can learn a lot of math with a bit of code.

Here is the link to the online code on my GitHub for this post. I recommend following along with the code as you read this post.

The Palindrome breaks down advanced math and machine learning concepts with visuals that make everything click.Join the premium tier to get access to the upcoming live courses on Neural Networks from Scratch and Mathematics of Machine Learning.

Import and inspect the GPT2 model.

A large language model (LLM) is a deep-learning model that is trained to input text and generate predictions about what text should come next. It’s a form of generative AI because it uses context and learned worldview information to generate new text.

If you think LLMs are so complicated that they are impossible to understand, then I have bad news for you… you’re wrong! LLMs are not so complicated, and you can learn all about them with just a high-school-level math background. If you’d like to use Python to learn how LLMs are designed and how they work, you can check out my 6-part series on using machine-learning to understand LLM mechanisms here on Substack.

There are two goals of this post: (1) to show you how easy ...