Explore LLM word representations using similarity analysis

Tivadar Danka peels back the curtain on the black box of artificial intelligence, revealing that the internal mechanics of language models are far more structured than their chaotic outputs suggest. While most observers marvel at what these systems say, Danka focuses on how they think, using a technique called Representational Similarity Analysis to prove that distinct mathematical pathways within a neural network converge on the same semantic truths. This is not just a tutorial on code; it is a rigorous demonstration that even when individual components appear uncorrelated, the collective architecture learns elegant, meaningful representations of the world.

The Architecture of Attention

Danka begins by dissecting the core mechanism of the transformer block: the attention algorithm. He explains that for every token processed, the model generates three distinct sets of activations—Query (Q), Keys (K), and Values (V). "The idea is this: For each pair of tokens, the dot product between their corresponding Q and K vectors creates a scalar weighting value, with higher dot products indicating that more importance (attention) should be paid to that pair," Danka writes. This weighting then dictates how much information from the V vectors is integrated into the final prediction.

Explore LLM word representations using similarity analysis

This framing is crucial because it shifts the reader's perspective from viewing the model as a static database to seeing it as a dynamic processor of context. Danka notes that these vectors are not merely storing data but are actively "rotated and scaled by each transformer layer," nudging the representation of a word based on everything that came before it. The author's decision to ignore the specific "attention heads" for this analysis is a strategic choice to isolate the fundamental mechanics, even though he admits that a per-head analysis could reveal even more granular details.

"These results show that although individual attention matrices have idiosyncratic internal calculations, they learn meaningful representations and words that allow them to interact in elegant and semantically meaningful ways."

The elegance of this argument lies in its counter-intuitive finding: the raw activation values of Q, K, and V are largely uncorrelated. Danka visualizes this with scatter plots showing near-zero correlations between the matrices. "The correlation between Q and K is weakly negative, while the other two pairs have correlations near zero," he observes. This might suggest a lack of coordination, but Danka argues the opposite. The lack of direct correlation is actually a feature, not a bug, allowing the system to maintain distinct computational roles while still achieving a unified understanding.

The Power of Representational Similarity

To prove that these distinct vectors are working together, Danka introduces Representational Similarity Analysis (RSA). This method does not compare the raw numbers directly; instead, it compares the relationships between the numbers. "An RSA (representational similarity analysis) works by comparing cosine similarity matrices across different embeddings spaces," Danka explains. The core insight is that even if two systems use different coordinate systems or dimensionalities, their internal structures can be identical if the relative similarities between concepts are strongly correlated.

This approach echoes the foundational work on Word2vec from 2013, which first demonstrated that words could be mapped to vectors where semantic relationships (like king minus man plus woman equals queen) were preserved mathematically. Danka extends this logic to the deep layers of modern models, showing that the "hidden layers" are not just noise but highly organized spaces. By creating binary matrix masks to isolate within-category similarities (like "galaxy-comet") versus across-category ones (like "star-sofa"), he demonstrates that the model consistently groups semantically related words, regardless of which vector (Q, K, or V) is being examined.

"The primary goal of this post series is to teach you the Representational Similarity Analysis (RSA), which is a machine-learning analysis that is used to compare distributed representations in different systems."

The experimental design here is particularly robust. Danka acknowledges that language models struggle with isolated words, often producing "unusual and outlier-like activations." To solve this, he prompts the model with the sentence "The next word is ___" for each target word. "This is good experimental design because it means that all words have identical context, and thus any differences and similarities can only be attributed to world-knowledge that the model learned about each word," he argues. This control ensures that the observed patterns are due to the model's internal knowledge, not the quirks of the input sequence.

Interpreting the Hidden Layers

The most striking revelation in Danka's analysis is the disconnect between direct correlation and structural similarity. While the raw values of the Q, K, and V vectors move independently, their similarity structures are highly consistent. Danka writes, "You will discover that the adjustment vectors are largely uncorrelated while their RSA scores are quite high." This suggests that the model uses different mathematical tools to arrive at the same semantic conclusion, a form of redundancy that likely contributes to the robustness of these systems.

A counterargument worth considering is whether this structural consistency implies true understanding or merely sophisticated pattern matching. Critics might argue that high RSA scores simply reflect the statistical regularities of the training data rather than a genuine grasp of semantics. However, Danka's focus on the relational nature of the data—how words relate to each other rather than their absolute values—offers a compelling middle ground. It suggests that the model has internalized a map of concepts that mirrors human categorization, even if the underlying mechanics are alien.

"The result of the attention algorithm is an adjustment to the embeddings vectors as they pass through the transformer stack... those adjustments nudge the vectors from pointing towards the tokens in the input... towards other tokens to generate an appropriate and context-relevant output."

Danka's work also highlights the practical challenges of interpretability. Accessing these internal calculations requires "hook functions" in PyTorch to capture data before it is destroyed, a process that generates massive amounts of data. "Even small LLMs create huge data matrices during each forward pass, and storing all of those internal calculations for each prompt would require terabytes of space," he notes. This technical hurdle underscores why such deep dives are rare; they require not just theoretical insight but significant computational engineering.

Bottom Line

Tivadar Danka's analysis successfully demystifies the attention mechanism by proving that the apparent chaos of raw activations masks a highly ordered, semantically rich internal structure. The piece's greatest strength is its methodological rigor, using RSA to reveal consistency where direct comparison fails, though it leaves open the philosophical question of whether this structural mimicry constitutes true understanding. For anyone seeking to move beyond the hype of what these models say to understand how they think, this is essential reading.

Explore LLM word representations using similarity analysis

by Tivadar Danka · The Palindrome · Read full article

What you will learn in this 2-part post series.

The primary goal of this post series is to teach you the Representational Similarity Analysis (RSA), which is a machine-learning analysis that is used to compare distributed representations in different systems.

If you haven’t already read Part 1 in this series, please do so! It provides necessary background about how the RSA score is calculated and interpreted.

As a brief reminder, an RSA (representational similarity analysis) works by comparing cosine similarity matrices across different embeddings spaces (layers, blocks, models, etc.). The idea is that different embeddings spaces may have distinct coordinate systems and even different dimensionalities, but if their internal representational structures are similar, the relative similarities should be strongly correlated even if the vectors are distinct.

The additional goals of this second post are (1) to learn more about RSA and category specificity, and (2) to learn how to dissect the “hidden layers” of an LLM, and in particular, the Query, Keys, and Values vectors inside the transformer block. Those Q, K, and V vectors are part of the mechanism by which LLMs figure out what information from previous words are relevant for using the current word to make predictions about subsequent words.

You will discover that the adjustment vectors are largely uncorrelated while their RSA scores are quite high. These results show that although individual attention matrices have idiosyncratic internal calculations, they learn meaningful representations and words that allow them to interact in elegant and semantically meaningful ways.

If you want to learn more about the attention algorithm, I can humbly recommend my post on the topic.

This post roughly corresponds to Project 38 in my recent book on using machine-learning projects to understand how LLMs work. Don’t worry, you don’t need the book to follow this post.

How to use the code with these posts.

The accompanying code file will reproduce all the figures in this post — but you can do so much more by thinking of the code as a starting-point for your continued explorations. Try changing parameters, adding new words or categories, using different similarity/distance metrics, different models, etc.

The code is available here on my GitHub. In the video below, I show how to get and run the code using Google Colab. You can also download the notebook file and run it locally, but I recommend using Colab because you won’t need to worry ...

Explore LLM word representations using similarity analysis

The Architecture of Attention

The Power of Representational Similarity

Interpreting the Hidden Layers

Bottom Line

Deep Dives

Sources

Explore LLM word representations using similarity analysis