AI Companies Access Paywalled Articles via Common Crawl Archive

A non-profit organization called Common Crawl is scraping billions of webpages to build a massive archive. This database has been used by major tech companies like OpenAI and Google to train their language models. However, it appears that Common Crawl is secretly sharing paywalled articles from top news websites with these companies.

The foundation claims that researchers use its collection for various purposes, such as studying book banning and analyzing online forums. But in 2012, its founder said the archive should be used “in the right way” to respect copyright. Instead, it seems Common Crawl is allowing tech giants to train their models using copyrighted content without permission.

This has raised concerns about intellectual property rights and whether publishers are being misled by the foundation’s claims.

Source: https://www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567