Wikipedia Deep Dive

Common Crawl

12 min read

In November 2025, an investigation by The Atlantic shattered the illusion of a benevolent digital librarian. The subject was the Common Crawl Foundation, a nonprofit organization that had long positioned itself as the open garden of the internet, freely tending to the world's data for researchers and startups. But the probe by journalist Alex Reisner revealed a stark contradiction: the organization had been misleading publishers about its respect for paywalls and ignoring explicit requests to remove their content from its massive databases. When publishers asked for their archives to be scrubbed, Common Crawl's public search function would return no results, falsely implying compliance, while the data remained actively available to train the most powerful artificial intelligence models on the planet. This revelation did not just expose a technical glitch; it exposed a fundamental rift in how the digital age's most valuable resource—the sum of human expression—is harvested, governed, and sold to the highest bidder.

To understand the gravity of this betrayal, one must first understand the architecture of the web before Common Crawl existed. For decades, the history of the internet was trapped in silos, most notably within the proprietary servers of search giants like Google. If a researcher wanted to analyze the evolution of language or the structure of global information, they were held hostage by the access policies of a single corporation. The internet was not a public square; it was a collection of walled estates. Common Crawl was founded in 2007 in San Francisco by Gil Elbaz, a search industry veteran, with a radical mission to break these walls. It was established as a 501(c)(3) nonprofit, a legal structure designed to serve the public good rather than generate profit. Its goal was simple yet audacious: to crawl the entire web, archive it, and make it freely available to anyone with an internet connection.

The organization began publishing its crawls in 2011, a modest beginning that would eventually reshape the landscape of data science. By 2013, the utility of this raw material was already being realized. Companies like TinEye, which built image recognition products, began constructing their technologies on the backbone of Common Crawl's archives. The crawl offered something the Internet Archive's Wayback Machine could not: fresher data, more frequent updates, and a structure designed for computational efficiency rather than historical preservation. It was designed to reduce the reliance of companies and researchers on Google, which, until then, held the largest and most comprehensive dataset of the open web. The Common Crawl Foundation was not just a mirror of the web; it was a new lens through which to view it.

The scale of this operation grew with terrifying speed. By 2015, the archive contained 1.8 billion webpages. These numbers were not arbitrary; they represented a massive shift in the balance of power. The project started by crawling a list of URLs donated by the search engine Blekko, but it quickly expanded to cover the vast majority of the indexed internet. The infrastructure required to store and process this deluge of information was staggering. Common Crawl turned to Amazon Web Services (AWS), which provided significant services for free, allowing the organization to keep its computing costs remarkably low, averaging between $2,000 and $4,000 a month. This efficiency was the key to its existence. Without this subsidized infrastructure, the cost of maintaining such a repository would have been prohibitive for a nonprofit.

For years, the organization operated in the quiet hum of academia. Before 2023, Common Crawl was not well known outside the circles of academic researchers who utilized its data for linguistic studies, sociological analysis, and early machine learning experiments. The Common Crawl website listed only 30 studies based on its data, a testament to its niche status. Peter Norvig, a renowned computer scientist and former director of research at Google, and Joi Ito, the former director of the MIT Media Lab, served on its advisory board, lending it intellectual credibility. Rich Skrenta, a veteran of the early web and founder of the Open Directory Project, served as the executive director. The organization was funded almost exclusively by the Elbaz Family Foundation Trust, keeping it insulated from the commercial pressures of the tech industry.

Then, the world changed. The 2020s arrived with the explosion of generative artificial intelligence. Large language models (LLMs) began to dominate the technological landscape, hungry for data on a scale never before imagined. These models required terabytes, even petabytes, of text to learn the nuances of human language, logic, and culture. Google's dataset was no longer the only game in town, but Common Crawl had become the new gold standard. By 2024, The Pile and Common Crawl had emerged as the two main training datasets being used to train AI models. The data that once served a few dozen researchers was now fueling the intelligence of systems that would soon write poetry, diagnose diseases, and drive cars.

The influx of interest brought a sudden and massive influx of money. In 2023, the financial landscape of Common Crawl shifted dramatically. The organization began receiving significant financial support from the very AI companies that were consuming its data. Anthropic and OpenAI, two of the most prominent players in the generative AI space, each donated $250,000. This was a pivotal moment. The entity that was supposed to be the public's guardian of data was now being bankrolled by the entities that wanted to privatize its utility. The data was also used to train Google DeepMind's large language model, Gemini, further cementing Common Crawl's role as the foundational bedrock of the AI revolution.

As the usage of the data skyrocketed, so did the scale of the archive. By April 2023, Common Crawl was capturing 3.1 billion webpages. The sheer volume of information was overwhelming, and with it came the inevitable pollution of the dataset. An estimated 5% of the pages captured before 2021 contained hate speech or slurs. The internet, after all, is a reflection of humanity in all its glory and its basest instincts. A 2024 study found that 45% of the content in the C4 dataset—a refined version of Common Crawl used by Google—was explicitly restricted by websites' terms of service to be used for purposes like AI training by for-profit companies. This statistic alone should have raised alarms. Nearly half the material used to train the most powerful AI systems was being harvested in direct violation of the wishes of the content creators.

In response to these growing concerns, Common Crawl began to face friction. In 2023, it received its first requests to redact information. The crawler, known as CCBot, began to be blocked with increasing frequency. By 2025, CCBot had become the most widely-blocked bot by the top 1,000 websites. The web was closing its doors. Publishers, journalists, and content creators were realizing that their work was being siphoned off to build machines that could potentially replace them. The ethical framework of the internet, which had long relied on the principle of `robots.txt`—a simple text file that allowed site owners to tell bots which parts of their site they could or could not access—was being tested to its breaking point.

The conflict came to a head in November 2025. The Atlantic's investigation, led by Alex Reisner, peeled back the layers of Common Crawl's operations to reveal a pattern of deception. The organization had claimed it respected paywalls in its scraping activities. It had claimed it honored requests from publishers to have their content removed. But the evidence told a different story. Reisner discovered that the public search function on the Common Crawl website was misleading. When a user searched for a website that had requested its archives be removed, the system would return no entries. It was a digital sleight of hand, a public relations trick designed to give the appearance of compliance while the data remained intact.

In reality, those sites were still included in the scrapes used by AI companies. The data was not gone; it was simply hidden from the public view, yet still accessible to the deep learning models that required it. This was not a mere administrative error; it was a systemic failure of accountability. The Common Crawl Foundation, a nonprofit with a mission to serve the public, was effectively acting as a middleman for the AI industry, facilitating the unauthorized use of copyrighted and restricted content. The 2025 investigation revealed that the organization had misled publishers, creating a false sense of security while the engines of the AI revolution continued to consume their work.

The implications of this revelation are profound. It challenges the very notion of "open data" in the age of AI. If the data that powers the next generation of technology is harvested without consent, without respect for terms of service, and without honoring removal requests, then the "open" in open data is a lie. It is a facade. The 2024 study that found 45% of content was restricted highlighted the scale of the problem, but the 2025 investigation showed that the organization responsible for curating this data was complicit in the violation.

Despite the controversy, the utility of Common Crawl remains undeniable. A 2026 article in LWN.net discussed a critical advantage of services like Common Crawl: it can limit the scraping costs to websites by allowing companies and researchers to download the data from a central repository instead of scraping it themselves. This argument, often made by the AI industry, posits that centralizing the data is more efficient and less burdensome to individual websites. However, this efficiency comes at the cost of agency. It removes the ability of individual content creators to control how their work is used. It concentrates power in the hands of a single nonprofit organization that, as the 2025 investigation showed, is not immune to the pressures of its major donors.

The ecosystem of data refinement has grown around Common Crawl, creating a complex web of derivatives. A number of organizations take the raw Common Crawl data and refine it into datasets that exclude edgy content or are otherwise higher-quality for their purposes. Projects like FineWeb, DCLM, and C4 are examples of this refinement process. The Google version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4, which was constructed for the training of the T5 language model series in 2019. As of 2023, there were concerns over the presence of copyrighted content in the C4 as well as racist content. These refinements are attempts to clean the raw data, to filter out the noise and the harm, but they cannot erase the fundamental issue of consent. The data was taken without permission, and no amount of cleaning can retroactively grant that permission.

The human cost of this data extraction is often abstract, but it is real. It is the loss of control for the journalist whose article is used to train a model that might one day replace them. It is the loss of revenue for the publisher whose paywall is breached. It is the erosion of trust in the institutions that are supposed to safeguard the public good. The Common Crawl Foundation was founded on the ideal of a free and open internet, a digital commons where knowledge could be shared without barrier. But as the 2020s progressed, that ideal was compromised by the relentless demand for data that drives the AI industry.

The story of Common Crawl is a cautionary tale for the digital age. It illustrates how easily noble intentions can be co-opted by commercial interests. It shows how the infrastructure of the internet can be repurposed to serve a new, powerful elite. And it highlights the urgent need for a new ethical framework for data collection in the age of artificial intelligence. The 2025 investigation by The Atlantic was a wake-up call. It revealed that the gates of the digital commons were not just open; they were being held open by the very people who claimed to protect them, while the content creators were locked out.

As of 2024, Common Crawl had been cited in more than 10,000 academic studies. This number is a testament to its impact, but it is also a reminder of the scale of the problem. The data is everywhere, and it is being used in ways that the original architects of the web could never have imagined. The question now is whether the internet can be reclaimed. Can the balance be restored between the needs of innovation and the rights of creators? The answer will depend on how the industry responds to the revelations of 2025. Will Common Crawl be reformed, or will it continue to serve as the silent partner in the unauthorized training of the world's most powerful machines?

The legacy of Gil Elbaz's vision is now inextricably linked to the controversies of the AI era. The organization that began as a beacon of openness has become a focal point for the debate over data rights. The 2026 discussion in LWN.net about the efficiency of centralized scraping only adds another layer to the complexity. Efficiency is important, but not at the expense of ethics. The human cost of data extraction is not just a footnote; it is the central issue. Every byte of data represents a human effort, a human story, a human right. To treat it as mere fuel for a machine is to devalue the very essence of the internet.

The future of Common Crawl, and indeed the future of the open web, hangs in the balance. The organization must decide whether to honor the principles it was founded on or to succumb to the demands of its largest donors. The 2025 investigation has laid the groundwork for this decision. It has exposed the cracks in the foundation. Now, the question is whether they will be fixed, or whether the entire structure will crumble under the weight of its own contradictions. The internet is not just a collection of data; it is a reflection of humanity. And if we allow that reflection to be distorted by the machines we build, we lose more than just our data. We lose ourselves.

Related Articles