← Back to Library

The anatomy of the least squares method, part two

Tivadar Danka makes a counterintuitive claim that will reshape how busy data practitioners approach their craft: the fastest route to mastering machine learning isn't wrestling with messy real-world datasets, but deliberately fabricating them. In a field often paralyzed by the tedious grind of data cleaning, Danka argues that simulation is not a shortcut, but a rigorous training ground for intuition that real data simply cannot provide.

The Case for Invented Reality

Danka reframes the standard workflow, suggesting that "finding, formatting, and processing data can be tedious and time-consuming," to the point where a practitioner might waste hours on a "barely usable dataset" when they could generate the exact scenario they need in minutes. This is a provocative stance for an industry obsessed with "big data," yet it holds up under scrutiny. By controlling the variables, the learner gains access to the one thing real-world analysis denies: ground truth.

The anatomy of the least squares method, part two

The author writes, "Simulated data gives ground truth. With real data, you run an analysis and get something, but you have no way of knowing whether that result is real, a quirk in the dataset, or a bug in your code." This distinction is critical. When you simulate a relationship between variables—say, the number of punk rock concerts attended and a person's reported happiness—you know the exact mathematical parameters used to create the noise. This allows you to verify if your algorithm is actually working or just hallucinating patterns.

"Running experiments with simulated data helps you speed up that skill development like a racehorse on Red Bull."

Danka's analogy is vivid, but the substance behind it is what matters. He posits that understanding how an algorithm behaves under controlled stress—varying sample sizes, injecting specific noise levels, or introducing outliers—builds a mental model of the tool's limits. This is the difference between driving a car on a familiar commute and taking it to a test track to understand its braking distance. A counterargument worth considering is that real-world data contains "annoyances" and structural quirks that simulations rarely capture, potentially leaving a practitioner unprepared for the chaos of live deployment. Danka acknowledges this, noting that the "best way to master machine learning is to know how to use both simulated and real data."

The Mechanics of Trust

The piece moves from philosophy to practice, demonstrating how to translate abstract equations into code that generates infinite data. Danka walks through a specific example where he sets a "ground-truth relationship" between an independent variable and an outcome, then deliberately adds random noise to mimic human variability. He is careful to note that while the data is fake, the conclusions drawn about the method are real.

He emphasizes the importance of visualizing the results, specifically the residuals—the errors between what the model predicted and what actually happened. "The correlation between the predicted data and the residuals is always exactly zero," he explains, a mathematical certainty of the least-squares method. However, he warns that the shape of the residual plot matters more than the correlation number. If the plot looks like a funnel or a teardrop rather than an "amorphous cloud," the model is failing to capture a non-linear relationship.

"The goal of statistical modeling is to capture the key essence of the data, not to overfit every tiny variation."

This warning against overfitting is the article's most practical takeaway. Danka explains how the standard R-squared metric can be misleading because it artificially inflates when you add more variables, even if those variables are just random noise. To combat this, he advocates for the "adjusted R-squared," which penalizes the model for adding unnecessary complexity. By scaling the metric based on the number of parameters versus the number of observations, the adjusted score ensures that a model only improves if the new data actually adds predictive power.

Critics might argue that relying too heavily on simulated data creates a false sense of security, where a model performs perfectly on clean, generated data but collapses when faced with the messy, biased reality of human behavior. Danka addresses this by promising a future post dedicated to the unique challenges of real data, suggesting a balanced curriculum rather than a replacement of one for the other.

Bottom Line

Danka's argument is a necessary correction to the "data hoarding" mentality, proving that controlled experimentation is superior to blind analysis for building deep intuition. While the risk of simulation is that it cannot replicate the full complexity of the real world, its value as a diagnostic tool for understanding model mechanics is undeniable. Readers should watch for the upcoming installment on real-data applications to see how these simulated intuitions hold up against the chaos of unstructured reality.

Sources

The anatomy of the least squares method, part two

by Tivadar Danka · The Palindrome · Read full article

Hey! It’s Tivadar from The Palindrome.

Mike X Cohen, PhD is here to continue our deep dive into the least squares method, the bread and butter of data science and machine learning.

Without further ado, I’ll pass the mic to him.

Enjoy!

Cheers,Tivadar

By the end of this post series, you will be confident about understanding, applying, and interpreting regression models (general linear models) that are solved using the famous least-squares algorithm. Here’s a breakdown of the post series:

Post 1 (the previous post): Theory and math. If you haven’t read this post yet, please do so before reading this one!

Post 2 (this post): Explorations in simulations. You’ll learn how to simulate data to supercharge your intuition for least-squares, how to visualize the results, and how to run experiments. You’ll also learn about residuals and overfitting.

Post 3: real-data examples. Simulated data are great because you have full control over the data characteristics and noise, but there’s no substitute for real data. And that’s what you’ll experience in this post. I’ll also teach you how to use the Python statsmodels library.

Post 4: modeling GPT activations. This post will be fun and fascinating. We’ll dissect OpenAI’s LLM GPT2, the precursor to its state-of-the-art ChatGPT. You’ll learn more about least-squares and also about LLM mechanisms.

Following along with code.

I’m a huge fan of learning math through coding. You can learn a lot of math with a bit of code.

That’s why I have Python notebook files that accompany my posts. The essential code bits are pasted directly into this post, but the complete code files, including all the code for visualization and additional explorations, are here on my GitHub.

If you’re more interested in the theory/concepts, then it’s completely fine to ignore the code and just read the post. But if you want a deeper level of understanding and intuition — and the tools to continue exploring and applying the analyses to your own projects — then I strongly encourage following along with the code while reading this post.

Here’s a video where I explain how to get my code from GitHub and follow along using Google Colab. It’s free (you need a Google account, but who doesn’t have one??) and runs in your browser, so you don’t need to install anything.

Why you should use simulated data when learning machine-learning.

Here’s why I love teaching data science using ...